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Abstract. In this paper we develop new Newton and conjugate gradient algorithms on the Grassmann and Stiefel 
manifolds. These manifolds represent the constraints that arise in such areas as the symmetric eigenvalue problem, nonlinear 
eigenvalue problems, electronic structures computations, and signal processing. In addition to the new algorithms, we show 
how the geometrical framework gives penetrating new insights allowing us to create, understand, and compare algorithms. 
The theory proposed here provides a taxonomy for numerical linear algebra algorithms that provide a top level mathematical 
view of previously unrelated algorithms. It is our hope that developers of new algorithms and perturbation theories will 
benefit from the theory, methods, and examples in this paper. 
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1. Introduction. Problems on the Stiefel and Grassmann manifolds arise with sufficient frequency 
that a unifying investigation of algorithms designed to solve these problems is warranted. Understanding 
these manifolds, which represent orthogonality constraints (as in the symmetric eigenvalue problem), 
yields penetrating insight into many numerical algorithms and unifies seemingly unrelated ideas from 
different areas. 

The optimization community has long recognized that linear and quadratic constraints have special 
structure that can be exploited. The Stiefel and Grassmann manifolds also represent special constraints. 
The main contribution of this paper is a framework for algorithms involving these constraints, which draws 
upon ideas from numerical linear algebra, optimization, differential geometry, and has been inspired by 
certain problems posed in engineering, physics, and chemistry. Though we do review the necessary 
background for our intended audience, this is not a survey paper. This paper uses mathematics as a tool 
so that we can understand the deeper geometrical structure underlying algorithms. 

In our first concrete problem we minimize a function F{Y), where Y is constrained to the set of 
n-hy-p matrices such that = / (we call such matrices orthonormal), and we make the further 

homogeneity assumption that F{Y) — F{YQ), where Q is any p-hy-p orthogonal matrix. In other words, 
the objective function depends only on the subspace spanned by the columns of Y; it is invariant to any 
choice of basis. The set of p-dimensional subspaces in R" is called the Grassmann manifold. (Grassmann 
originally developed the idea in 1848, but his writing style was considered so obscure |l| that it was 
appreciated only many years later. One can find something of the original definition in his later work 



48 , Chap. 3, Sec. 1, Article 65].) To the best of our knowledge, the geometry of the Grassmann manifold 
has never been explored in the context of optimization algorithms, invariant subspace computations, 
physics computations, or subspace tracking. Useful ideas from these areas, however, may be put into the 
geometrical framework developed in this paper. 

In our second problem we minimize F{Y) without the homogeneity condition F{Y) = F{YQ) men- 
tioned above, i.e., the optimization problem is defined on the set of n-hy-p orthonormal matrices. This 
constraint surface is known as the Stiefel manifold, which is named for Eduard Stiefel, who considered 
its topology in the 1930s This is the same Stiefel who in collaboration with Magnus Hestenes in 

1952 originated the conjugate gradient algorithm Both Stiefel's manifold and his conjugate gradient 
algorithm play an important role in this paper. The geometry of the Stiefel manifold in the context of 
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optimization problems and subspace tracking was explored by Smith ||75| . In this paper we use numerical 
linear algebra techniques to simplify the ideas and algorithms presented there so that the differential geo- 
metric ideas seem natural and illuminating to the numerical linear algebra and optimization communities. 

The first author's original motivation for studying this problem came from a response to a linear 
algebra survey pO| |, which claimed to be using conjugate gradient to solve large dense eigenvalue problems. 
The second and third authors were motivated by two distinct engineering and physics applications. The 
salient question became: "What does it mean to use conjugate gradient to solve eigenvalue problems?" Is 
this the Lanczos method? As we shall describe, there are dozens of proposed variations on the conjugate 
gradient and Newton methods for eigenvalue and related problems, none of which are Lanczos. These 
algorithms are not all obviously related. The connections among these algorithms have apparently not 
been appreciated in the literature while in some cases numerical experiments have been the only basis for 
comparison when no theoretical understanding was available. The existence of so many variations in so 
many applications compelled us to ask for the big picture: What is the mathematics that unifies all of 
these apparently distinct algorithms. This paper contains our proposed unification. 

We summarize by itemizing what is new in this paper. 

1. Algorithms for Newton and conjugate gradient methods on the Grassmann and Stiefel manifolds 
that naturally use the geometry of these manifolds. In the special cases that we are aware of, our 
general algorithms are competitive up to small constant factors with the best known special algorithms. 
Conjugate gradient and Newton on the Grassmann manifold have never been studied before explicitly. 
Stiefel algorithms have been studied before [Q, but the ideas here represent considerable simplifications. 

2. A geometrical framework with the right mix of abstraction and concreteness to serve as a foun- 
dation for any numerical computation or algorithmic formulation involving orthogonality constraints, 
including the symmetric eigenvalue problem. We believe that this is a useful framework because it con- 
nects apparently unrelated ideas; it is simple and mathematically natural. The framework provides new 
insights into existing algorithms in numerical linear algebra, optimization, signal processing, and elec- 
tronic structures computations, and it suggests new algorithms. For example, we connect the ideas of 
geodesies and the cubic convergence of the Rayleigh quotient iteration, the CS decomposition, and se- 
quential quadratic programming. We also interpret the ill-conditioning of eigenvectors of a symmetric 
matrix with multiple eigenvalues as the singularity of Stiefel and Grassmann coordinates. 

3. Though geometrical descriptions of the Grassmann and Stiefel manifolds are available in many 
references, ours is the first to use methods from numerical linear algebra emphasizing computational 
efficiency of algorithms rather than abstract general settings. 

The remainder of this paper is organized into three sections. The geometrical ideas are developed 
in §|[ This §^ provides a self-contained introduction to geometry, which may not be familiar to some 
readers, while deriving the new geometrical formulas necessary for the algorithms of §^ and the insights 
of §^ provides descriptions of new algorithms for optimization on the Grassmann and Stiefel manifolds. 
Concrete examples of the new insights gained from this point of view are presented in Because we wish 
to discuss related literature in the context developed in §|| and §|[ we defer discussion of the literature to 
where specific applications of our theory are organized. 

2. DifTerential Geometric Foundation for Numerical Linear Algebra. A geometrical treat- 
ment of the Stiefel and Grassmann manifolds appropriate for numerical linear algebra cannot be found in 
standard differential geometry references. For example, what is typically required for practical conjugate 
gradient computations involving n-hy-p orthonormal matrices are algorithms with complexity of order np^. 
In this section we derive new formulas that may be used in algorithms of this complexity in terms of stan- 
dard operations from numerical linear algebra. These formulas will be used in the algorithms presented 
in the following section. Because we focus on computations, our approach differs from the more general 
(and powerful) coordinate- free methods used by modern geometers [|l8[ Q |6^, ^ Boothby ||] 
provides an undergraduate level introduction to the coordinate-free approach. 

For readers with a background in differential geometry, we wish to point out how we use extrinsic 
coordinates in a somewhat unusual way. Typically, one uses a parameterization of the manifold (e.g., 
X — cos M sin w, y = sin it sin w, z = cosv for the sphere) to derive metric coefficients and Christoffcl 
symbols in terms of the parameters {u and v). Instead, we only use extrinsic coordinates subject to 
constraints (e.g., {x,y,z) such that + + = 1). This represents points with more parameters 
than are intrinsically necessary, but we have found that the simplest (hence computationally most useful) 
formulas for the metric and Christoffel symbol are obtained in this manner. The choice of coordinates 
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Table 2.1 
Representations of subspace manifolds 



Space 


Symbol 


VIatrix 


Rep 


Quotient Rep. 


Orthogonal Group 


o„ 


Q 




Stiefel Manifold 






Y 




On / On — p 












( V„,p/Op 1 


Grassmann Manifold 


Gn.p 


None 














\ On/ {Op X On-p) J 



does not matter abstractly, but on a computer the correct choice is essential. 

We now outline this section. After defining the manifolds of interest to us in §p.l|, we take a close 



look at the Stiefel manifold as a submanifold of Euclidean space in §2.2. This introduces elementary ideas 
from differential geometry and provides the geometric structure of the orthogonal group (a special case of 
the Stiefel manifold), which will be used throughout the rest of the paper. However, the Euclidean metric 
is not natural for the Stiefel manifold, which inherits a canonical metric from its definition as a quotient 
space. Therefore, we introduce the quotient space point of view in §|]3. With this viewpoint, we then 



derive our formulas for geodesies and parallel translation for the Stiefel and Grassmann manifold in 92. 



and §2.5. Finally, we describe how to incorporate these formulae into conjugate gradient and Newton 



methods in §2.6 



2.1. Manifolds Arising in Numerical Linear Algebra. For simplicity of exposition, but for 
no fundamental reason, we will concentrate on real matrices. All ideas carry over naturally to complex 
matrices. Spaces of interest are 

1. The orthogonal group 0„ consisting of n-by-n orthogonal matrices 

2. The Stiefel manifold T^i p consisting of n-hy-p "tall-skinny" orthonormal matrices 

3. The Grassmann manifold Gn,p obtained by identifying those matrices in Vn^p whose columns 
span the same subspace (a quotient manifold). 



Table 2.1 summarizes the definitions of these spaces. Our description of Gn,p is necessarily more 
abstract than 0„ or Vn^p. Gn,p may be defined as the set of all p-dimensional subspaces of an n-dimensional 
space. 

We shall benefit from two different yet equivalent modes of describing our spaces: concrete represen- 
tations and quotient space representations. Table illustrates how we store elements of Vn^p and Gn.p 
in a computer. A point in the Stiefel manifold Vn,p is represented by an n-hy-p matrix. A point on the 
Grassmann manifold Gn^p is a linear subspace, which may be specified by an arbitrary orthogonal basis 
stored as an n-hy-p matrix. An important difference here is that, unlike points on the Stiefel manifold, 
the choice of matrix is not unique for points on the Grassmann manifold. 

The second mode of representation, the more mathematical, is useful for obtaining closed-form expres- 
sions for the geometrical objects of interest. It is also the "proper" theoretical setting for these manifolds. 
Here, we represent the manifolds as quotient spaces. Points in the Grassmann manifold are equivalence 
classes of n-by-p orthogonal matrices, where two matrices are equivalent if their columns span the same 
^i-dimensional subspace. Equivalently, two matrices are equivalent if they are related by right multiplica- 
tion of an orthogonal p-by-p matrix. Therefore, Gn,p = Vn,p/Op- On the computer, by necessity, we must 
pick a representative of the equivalence class to specify a point. 

The Stiefel manifold may also be defined as a quotient space but arising from the orthogonal group. 
Here we identify two orthogonal matrices if their first p columns are identical or, equivalently, if they are 
related by right multiplication of a matrix of the form where Q is an orthogonal {n — p)-hy-{n — p) 
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Table 2.2 

Computational representation of subspace manifolds 



Space 



Stiefel Manifold 



Data Structure Represents 



Tangents A 



Y 



one point y-^A = skew-symmetric 



Grassmann Manifold 



Y 



entire equivalence class 



y^A = o 



block. Therefore, Vn,p = On/On-p- With the Stiefel manifold so represented, one has yet another 
representation of the Grassmann manifold, Gn,p — On/{Op x On-p)- 

2.2. The Stiefel Manifold in Euclidean Space. The Stiefel manifold Vn,p may be embedded in 
the np dimensional Euclidean space of n-hy-p matrices. When p = 1, we simply have the sphere, while 
when p = n, we have the group of orthogonal matrices known as 0„. These two special cases are the 
easiest, and arise in numerical linear algebra the most often. 

Much of this section, which consists of three subsectio ns, is designed to be a painless and intuitive 
introduction to differential geometry in Euclidean space. §2.2.1 is elementary. It derives formulas for 
projections onto the tangent and normal spaces. In § 2.2. 2| , we derive formulas for geodesies on the Stiefel 
manifold in Euclidean space. We then discuss parallel translation in j |2.2.3 . 

In the two special cases when p = 1 and p = n, the Euclidean metric and the canonical metric to be 
discussed in §2.4 are the same. Otherwise they differ. 



2.2.1. Tangent and Normal Space 

to the submanifold at that point, as shown in Figure p.l 



Intuitively, the tangent space at a point is the plane tangent 
For d-dimensional manifolds, this plane is a 
d-dimensional vector space with origin at the point of tangency. The normal space is the orthogonal 
complement. On the sphere, tangents are perpendicular to radii, and the normal space is radial. In this 
subsection, we will derive the equations for the tangent and normal spaces on the Stiefel manifold. We 
also compute the projection operators onto these spaces. 




Fig. 2.1. The tangent and normal spaces of an embedded or constraint manifold. 



An equation defining tangents to the Stiefel manifold at a point Y is easily obtained by differentiating 
Y^Y = I, yielding Y^A + A^Y = 0, i.e., Y^A is skew-symmetric. This condition imposes p{p + l)/2 
constraints on A, or equivalently, the vector space of all tangent vectors A has dimension 

(2.1) np = \-p(n~p). 
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Both sides of Eq. ( ^.l| ) are useful for the dimension counting arguments that will be employed. 

The normal space is defined to be the orthogonal complement of the tangent space. Orthogonality 
depends upon the definition of an inner product, and because in this sub-section we view the Stiefel 
manifold as an embedded manifold in Euclidean space, we choose the standard inner product 

(2.2) ge(Ai,A2) =trAfA2, 

in np-dimensional Euclidean space (hence the subscript e), which is also the Frobenius inner product 
for n-by-p matrices. We shall also write (Ai, A2) for the inner product, which may or may not be the 
Euclidean one. The normal space at a point Y consists of all matrices N which satisfy 

tr A^iV = 

for all A in the tangent space. It follows that the normal space is p{p + l)/2 dimensional. It is easily 
verified that if iV = YS, where S is p-hy-p symmetric, then N is in the normal space. Since the dimension 
of the space of such matrices is p{p + l)/2, we see that the normal space is exactly the set of matrices 
{ YS }, where S is any p-hy-p symmetric matrix. 

Let Z be any n-hy-p matrix. Letting sym{A) denote {A + A^)/2 and skew(A) = {A — A'^)/2, it is 
easily verified that at Y 

(2.3) 7rA,(Z) = rsym(r^Z) 
defines a projection of Z onto the normal space. Similarly, at Y , 

(2.4) ttt{Z) ^ Y skew(y^Z) + (/ - YY'^)Z 

is a projection of Z onto the tangent space at Y (this is also true of the canonical metric to be discussed 



in §2.4). Eq. (2.4) suggests a form for the tangent space of Vn,p at Y that will prove to be particularly 



useful. Tangent directions A at F then have the general form, 

(2.5) A^YA + Y_lB 

(2.6) =YA + (I -YY'^)C 

where A is p-hy-p skew-symmetric, _B is (n — p)-hy-p, C is n-hy-p, B and C are both arbitrary, and 11 is 
any n-hy-{n~p) matrix such that YY"^ + Y\Y\^ = I; note that B = Y±^C. The entries in the matrices A 
and B parameterize the tangent space at Y with p(p — l)/2 degrees of freedom in A a nd p {n — p) degrees 
of freedom in B, resulting in p(p — l)/2 -I- p{n — p) degrees of freedom as seen in Eq. (|2.l| ). 

In the special case Y — In,p = (^q) (the first p columns of the n-hy-n identity matrix), called the 
origin, the tangent space at Y consists of those matrices 



X 



for which A is p-hy-p skew-symmetric and _B is (n — p)-hy-p arbitrary. 

2.2.2. Embedded Geodesies. A geodesic is the curve of shortest length between two points on a 
manifold. A straightforward exercise from the calculus of variations reveals that for the case of manifolds 
embedded in Euclidean space the acceleration vector at each point along a geodesic is normal to the 
submanifold so long as the curve is traced with uniform speed. This condition is necessary and sufficient. 
In the case of the sphere, acceleration for uniform motion on a great circle is directed radially and therefore 
normal to the surface; therefore, great circles are geodesies on the sphere. One may consider embedding 
manifolds in spaces with arbitrary metrics. See Spivak |7^, vol. 3, p. 4] for the appropriate generalization. 

Through Eq. ( |2.3| ) for the normal space to the Stiefel manifold, it is easily shown that the geodesic 
equation for a curve Y{t) on the Stiefel manifold is defined by the differential equation 

(2.7) Y + Y{Y'^Y) = Q. 

To see this, we begin with the condition that Y{t) remains on the Stiefel manifold, 

(2.8) Y^Y = Ip. 
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Taking two derivatives, 

(2.9) Y'^Y + 2r'^r + Y^Y = 0. 
To be a geodesic, Y{t) must be in the normal space at Y{t) so that 

(2.10) Y{t)+Y{t)S = Q 



for some symmetric matrix S. Substitute Eq. (2.10) into (2.9) to obtain the geodesic equation Eq. ( p.7[ ). 
Alternatively Eq. ( p. 7] ) could be obtained from the Euler-Lagrange equation for the calculus of variations 
problem 

(2.11) d{Yi,Y2) = min j \trY^Yf/^ dt such that Y{ti) = Fi, Y{t2) = I2. 
We identify three integrals of motion of the geodesic equation Eq. (2.7). Define 

(2.12) c = y^r, a = y'^y, s^y'^y. 

Directly from the geodesic equation Eq. (p?7|), 

C = A + A^, 
A = -CS + S, 
S=[A,S], 

where 

(2.13) [A,S]=AS-SA 

is the Lie bracket of two matrices. Under the initial conditions that Y is on the Stiefel manifold (C = /) 
and F is a tangent [A is skew-symmetric) then the integrals of the motion have the values 

C{t) = /, 

A{t) = A(0), 

S{t) = e^*5(0)e-^*. 

These integrals therefore identify a constant speed curve on the Stiefel manifold. In most differential 
geometry books, the equation of motion for geodesies is written in intrinsic coordinates in terms of so- 
called Christoffel symbols which specify a quadratic form of the tangent vectors. In our formulation, the 
form V^iY ,Y) = YY^Y is written compactly in extrinsic coordinates. 

With these constants of the motion, we can write an integrable equation for the final geodesic,^ 

V.-) . (ye- r.-) (f 'f> 

with integral 

y(t) = (r(o), y(o)) expt(^^ h,^,e-^'. 

This is an exact closed form expression for the geodesic on the Stiefel manifold, but we will not use 
this expression in our computation. Instead we will consider the non-Euclidean canonical metric on the 
Stiefel manifold in §2.4. 

We mention in the case of the orthogonal group {p = n), the geodesic equation is obtained simply 
from A = Q^Q — constant, yielding the simple solution 

(2.14) Q{t) = Q(0)e^*. 



From Eq. ( 2.14 ) it is straightforward to show that on connected components of 0„, 

1/2 

(2.15) d{Q,,Q2) = [y.(^l 

where {e^^*'} are the eigenvalues of the matrix Q\Q2 (cf. Eq. (2.67) and §4.3). 



^We thank Ross Lippert |56[ for this observation. 
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2.2.3. Parallel Translation. In Euclidean space, we move vectors parallel to themselves simply by 
moving the base of the arrow. On an embedded manifold, if we move a tangent vector to another point on 
the manifold by this technique, it is generally not a tangent vector. One can, however, transport tangents 
along paths on the manifold by infinitesimally removing the component of the transported vector in the 
normal space. 




Fig. 2.2. Parallel transport in a submanifold of Euclidean space (infinitesimal construction). 



Figure 2.2 illustrates the idea: Imagine moving a tangent vector A along the curve Y{t) in such 
a manner that every infinitesimal step consists of a parallel displacement of A in the Euclidean np 
dimensional space which is then followed by the removal of the normal component. If we move from 
y(0) = y to Y{e) then to first order, our new location is F + eY . The equation for infinitesimally 
removing the component generated in the normal space as we move in the direction Y is obtained by 



differentiating Eq. (2.3) 



(2.16) A -y(y^A + A^y)/2, 

We are unaware of any closed form solution to this system of differential equations along geodesies. 

By differentiation, we see that parallel transported vectors preserve the inner product. In particular, 
the square length of A (tr A-^A) is preserved. Additionally, inserting Y into the parallel transport equation, 
one quickly sees that a geodesic always parallel transports its own tangent vector. This condition may be 
taken as the definition of a geodesic. 

Observing that tr A-^A is the sum of the squares of the singular values of A, we conjectured that 
the individual singular values of A might also be preserved by parallel transport. Numerical experiments 
show that this is not the case. 

In the case of the orthogonal group {p = n), however, parallel translation of A along the geodesic 
Q{t) — Q{0)e^* is straightforward. Let A{t) = Q{t)B{t) be the solution of the parallel translation 
equation 

A = -g(o^A + A^g)/2, 

where B{t) is a skew-symmetric matrix. Substituting A = QB + QI3 and Q ~ QA, we obtain 

(2.17) B^-^[A,B] 

whose solution is B{t) = e~'^*/^i?(0)e"^*/^; therefore, 

(2.18) A{t) = Q(0)e^*/2B(0)e^*/2^ 



These formulas may be generalized to arbitrary connected Lie groups |47, Chap. 2, Ex. A. 6]. 

So as to arrive at the general notion of parallel transport, let us formalize what we did here. We saw 
that the geodesic equation may be written 



y + re(r,r) = 0, 
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where in the Euclidean case 

re(Ai, A2) = Y{^ll^2 + A^Ai)/2. 

Anticipating the generahzation, we interpret F as containing the information of the normal component 
that needs to be removed. Knowing the quadratic function r(A, A) is sufficient for obtaining the bilinear 
function r(Ai, A2); the process is called polarization. We assume that F is a symmetric function of its 
arguments (this is the so-called torsion- free condition), and we obtain 

4r(Ai,A2) -r(Ai + A2,Ai + A2) -r(Ai - A2,Ai - A2). 

For the cases we study in this paper, it is easy in practice to guess a symmetric form for r(Ai, A2) given 

r(A,A). 



We will give a specific example of why this formalism is needed in {2A. Let us mention here that the 
parallel transport defined in this manner is known to differential geometers as the Levi-Civita connection. 
We also remark that the function T when written in terms of components defines the Christoffel symbols. 
Switching to vector notation, in differential geometry texts the iih component of the function r(t;, w) 
would normally be written as ^^-^ V^^^VjiVk, where the constants F*^ are called Christoffel symbols. We 
prefer the matrix notation over the scalar notation. 

2.3. Geometry of Quotient Spaces. Given a manifold whose geometry is well understood (where 
there are closed form expressions for the geodesies and, perhaps also, parallel transport), there is a 
very natural, efficient, and convenient way to generate closed form formulas on quotient spaces of that 
manifold. This is precisely the situation with the Stiefel and Grassmann manifolds, which are quotient 
spaces within the orthogonal group. As just seen in the previous section, geodesies and parallel translation 
on the orthogonal group are simple. We now show how the Stiefel and Grassmann manifolds inherit this 
simple geometry. 

2.3.1. The Quotient Geometry of the Stiefel Manifold. The important ideas here are the 
notions of the horizontal and vertical spaces, the metric, and their relationship to geodesies and parallel 
translation. We use brackets to denote equivalence classes. We will define these concepts using the Stiefel 
manifold Vn,p — On/On-p as an example. The equivalence class [Q] is the set of all n-hy-n orthogonal 
matrices with the same first p columns as Q. A point in the Stiefel manifold is the equivalence class 

(2.19) [Q] - I g ( 0^ ) : Qn-p e 0„- 

that is, a point in the Stiefel manifold is a particular subset of the orthogonal matrices. Notice that in 
this section we are working with equivalence classes rather than n-hy-p matrices Y = QIn,p- 

The vertical and horizontal spaces at a point Q are complementary linear subspaces of the tangent 
space at Q. The vertical space is defined to be vectors tangent to the set [Q]. The horizontal space is 
defined as the tangent vectors at Q orthogonal to the vertical space. At a point Q, the vertical space is 
the set of vectors of the form 

(2.20) * = Q(o C 

where C is (n — p)-hy-{n — p) skew-symmetric, and we have hidden post-multiplication by the isotropy 



subgroup ij'' Q ). Such vectors are clearly tangent to the set [Q] defined in Eq. ( 2.19 ). It follows that 
the horizontal space at Q is the set of tangents of the form 

(2.21) ^=q{b 

(also hiding the isotropy subgroup), where A is p-hy-p skew-symmetric. Vectors of this form are clearly 
orthogonal to vertical vectors with respect to the Euclidean inner product. The matrices A and B of 
Eq. ( ^.21 ) are equivalent to those of Eq. ( |2.5| ). 



The significance of the horizontal space is that it provides a representation of tangents to the quotient 
space. Intuitively, movements in the vertical direction make no change in the quotient space. Therefore, 
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the metric, geodesies, and parallel translation must all be restricted to the horizontal space. A rigorous 
treatment of these intuitive concepts is given by Kobayashi and Nomizu and Chavel . 

The canonical metric on the Stiefel manifold is then simply the restriction of the orthogonal group 
metric to the horizontal space (multiplied by 1/2 to avoid factors of 2 later on). That is, for Ai and A2 
of the form in Eq. (|2.2l|). 



(2.22) 



5c(Ai,A2) 



^trAjA2+trBfB2 



Q 



A2 
B2 



which we shall also write as (Ai, A2). It is important to realize that this is not equal to the Euclidean 
metric defined in §2.2 (except for p = 1 or n), even though we use the Euclidean metric for the orthog- 
onal group in its definition. The difference arises because the Euclidean metric counts the independent 
coordinates of the skew-symmetric A matrix twice and those of B only once, whereas the canonical metric 



counts all independent coordinates in A and B equally. This point is discussed in detail in §2.4 
Notice that the orthogonal group geodesic 



(2.23) 

has horizontal tangent 
(2.24) 



Q{t) = Q(0)expt 



A -B^ 
B 



Q{t) = Q{t) 



A ^B^ 
B 



at every point along the curve Q{t). Therefore they are curves of shortest length in the quotient space as 
well, i.e., geodesies in the Grassmann manifold are given by the simple formula 



(2.25) 



Stiefel geodesies — [Q[t)\, 



where [Q{t)] is given by Eqs. ( 2.19 ) and ( 2.23 ). This formula will be essential for deriving an expression 
for geodesies on the Stiefel manifold using n-hy-p matrices in § ^.4| . 

In a quotient space, parallel translation works in a way similar to the embedded parallel translation 
discussed in § 2.2.3 , Parallel translation along a curve (with everywhere horizontal tangent) is accomplished 
by infinitesimally removing the vertical component of the tangent vector. The equation for parallel 
translation along the geodesies in the Stiefel manifold is obtained by applying this idea to Eq. (2.17), 
which provides translation along geodesies for the orthogonal group. Let 



(2.26) 



A = 



Bi 



and B = 



A2 
B2 



be two horizontal vectors at Q 
differential equation 



/. The parallel translation of B along the geodesic e is given by the 



(2.27) 



B 



where the subscript H denotes the horizontal component (lower right block set to zero). Note that the 
Lie bracket of two horizontal vectors is not horizontal, and that the solution to Eq. (2.27) is not given by 
the formula(e~^*/^B(0)e^*/^)H- This is a special case of the general formula for reductive homogeneous 
spaces jl^, ^ . This first order linear differential equation with constant coefficients is integrable in closed 
form, but it is an open question whether this can be accomplished with 0{np^) operations. 

2.3.2. The Quotient Geometry of the Grassmann Manifold. We quickly repeat this approach 
for the Grassmann manifold Gn,p = On/ {Op x On-p)- The equivalence class [Q] is the set of all orthogonal 
matrices whose first p columns span the same subspace as those of Q. A point in the Grassmann manifold 
is the equivalence class 



(2.28) 



Q 







Qp 

Qn—p 



^p G Op-, Qn—p G 0',i—p 
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i.e., a point in the Grassmann manifold is a particular subset of the orthogonal matrices, and the Grass- 
mann manifold itself is the collection of all these subsets. 

The vertical space at a point Q is the set of vectors of the form 



(2.29) 



$ = Q 



A 
c 



where A is p-hy-p skew-symmetric and C is {n — p)-\yy-(n — p) skew-symmetric. The horizontal space at 
Q is the set of matrices of the form 



(2.30) 



-B^ 
B 



in Eqs. ( |2.29| ) and 



Note that we have hidden post-multiplication by the isotropy subgroup { q 

The canonical metric on the Grassmann manifold is the restriction of the orthogonal group metric to 
the horizontal space (multiplied by 1/2). Let Ai and A2 be of the form in Eq. (2.30). Then 

(2.31) (7,(Ai,A2) = trBfB2. 

As opposed to the canonical metric for the Stiefel manifold, this metric is in fact equivalent to the 
Euclidean metric (up to multiplication by 1/2) defined in Eq. (2.2). 
The orthogonal group geodesic 



(2.32) 

has horizontal tangent 
(2.33) 



Q{t) = Q(0)expt 







Q{t) = Q{t) 



-B^ 
B 



at every point along the curve Q{t); therefore, 

(2.34) Grassmann geodesies = [(3(t)], 



where [Qit)] is given by Eqs. ( 2.2S|) and (2.32). This formula gives us an easy method for computing 



geodesies on the Grassmann manifold using n-hy-p matrices, as will be seen in §2.5 



The method for parallel translation along geodesies in the Grassmann manifold is the same as for the 
Stiefel manifold, although it turns out the Grassmann manifold has additional structure that makes this 
task easier. Let 



(2.35) 



A = 



-A^ 




and B = 



-B^ 




be two horiz onta l vectors at Q = /. It is easily ve rified that [A, B] is in fact a vertical vector of the 
form of Eq. (2.29). If the vertical component of Eq. (2.17) is infinitesimally removed, we are left with the 
trivial differential equation 



(2.36) 



B = 0. 



Therefore, the parallel translation of the tangent vector (5(0)B along the geodesic Q{t) ~ (3(0)e^* is 
simply given by the expression 



(2.37) 



TB{t) = Q(0)e'^*B, 



which is of course horizontal at Q{t). Here we introduce the notation r to indicate the transport of a 
vector; it is not a scalar multiple of the vector. It will be seen in § ^.5| how this formula may be computed 
using 0{np^) operations. 
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As an aside, if H and V represent the horizontal and vertical spaces, respectively, it may be verified 



that 



(2.38) 



[V,V]CV, [V,H]CH, [H,H]CV. 



The first relationship follows from the fact that is a Lie algebra, the second follows from the reductive 
homogeneous space structure of the Grassmann manifold, also possessed by the Stiefcl manifold, and 
the third follows the symmetric space structure j47[ |5^ of the Grassmann manifold, which the Stiefel 
manifold does not possess. 

2.4. The Stiefel Manifold with its Canonical Metric. 

2.4.1. The Canonical Metric (Stiefel). The Euclidean metric 

ge(A,A) =trA^A 



used in §2.2 may seem natural, but one reasonable objection to its use is that it weighs the independent 
degrees of freedom of the tangent vector unequally. Using the representation of tangent vectors A = 
YA + YiB given in Eq. (|2^), it is seen that 

ge(A,A) =trA'^A + trB'^B. 



The Euclidean metric counts the p(p + l)/2 independent coordinates of A twice. At the origin In,p, a 
more equitable metric would be gc(A, A) = tr A-^(/ — ij„^p/Jp)A = i tr A'^A + tr B^B. To be equitable 
at all points in the manifold, the metric must vary with Y according to 



(2.39) 



5,(A,A) =trA^(/- iry^)A. 



This is called the canonical metric on the S tiefel manifold. This is precisely the metric derived from the 
quotient space structure of Vn.p i n Eq. ( 2.23 ); therefore, the formulas for geodesies and parallel translation 
for the Stiefel manifold given in 2.3.1 are corr ect if we view the Stiefel manifold as the set of orthonormal 
n-hy-p matrices with the metric of Eq. ( 2.39 ). Note that if A YA + Y^B is a tangent vector, then 
gc(A, A) = i tr A^A + tr B^B, as seen previously. 



2.4.2. Geodesies (Stiefel). The path length 



(2.40) 



L = I g,{Y,YY'^dt 



may be minimized with the calculus of variations. Doing so is tedious but yields the new geodesic equation 
(2.41) Y + YY^Y + Y{{Y^Yf +Y^Y) ^Q, 



Direct substitution into Eq. (2.41) using the fact that 
if A" is a skew-symmetric matrix of the form 

A -B^' 



X 



B 



verifies that the paths of the form 
(2.42) 



are closed form solutions to the geodesic equation for the canonical metric. 
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We now turn to the problem of computing geodesies with algorithms of complexity O(np^). Our 

current formula Y(t) = Qexpi ^^n.p for a geodesic is not useful. Rather we want to express 

the geodesic Y{t) in terms of the current position Y{0) = Y and a direction 1^(0) = H. For example, 
A = Y^H and we have C := B^B = H^{I — YY^)H. In fact the geodesic only depends on B^B rather 

than B itself. The trick is to find a differential equation for M{t) — I^^expt (^^ q 

The following theorem makes clear that the computational difficulty inherent in computing the 
geodesic is the solution of a constant coefficient second order differential equation for M{t). The an- 
swer is obtained not by a differential equation solver but rather by solving the corresponding quadratic 
eigenvalue problem: ^ 

Theorem 2.1. IfY{t) = Qe^i^ "° )/„,p, with Y{0) = Y and Y{0) = H, then 



(2.43) Y{t) = YM{t) + {I - YY' )H / M{t) dt, 

Jo 

where M{t) is the solution to the second order differential equation with constant coefficients, 



(2.44) M-AM + CM^O; Af (0) = /p, M{0) = A, 

A = Y^H is skew -symmetric, and C = H^{I — YY^)H is non-negative definite. 

Proof. A direct computation verifies t hat M{t) satisfies Eq. ( ^.44| ). By separately considering Y'^Y{f) 
and (/ - YY'^)Y{t), we may derive Eq. (|1|). □ 



The solution of the differential equation Eq. ( 2.44 ) may be obtained |2^, |8^ by solving the quadratic 
eigenvalue problem 

(A^/ -AX + C)x = 0. 

Such problems are typically solved in one of three ways: (1) by solving the generalized eigenvalue problem 

C 0\ f x\ f A -I\ f X 
l)\\x)^\I Q )\\x 

(2) by solving the eigenvalue problem 

-C A)\\x) \Xx 
or (3) any equivalent problem obtained by factoring C = K^K and then solving the eigenvalue problem 

A ~K^\ fx\ , fx" 



K ) \y ) ^ \y 

Problems of this form arise frequently in mechanics, usually with A symmetric. Some discussion of 
physical interpretations for skew-symmetric matrices may be found in the context of rotating machinery 
If X is the p-hy-2p matrix of eigenvectors and A denotes the eigenvalues, then M{t) = Xe^^Z, and 
its integral is / M{t) dt = Xe^*h~^Z, where Z is chosen so that XZ = I and XKZ = A. 
Alternatively, the third method along with the matrix exponential may be employed: 
Corollary 2.2. Let Y and H he n-by-p matrices such that Y^Y — Ip and A = Y^H is skew- 
symmetric. Then the geodesic on the Stiefel manifold emanating from Y in direction H is given by the 
curve 

(2.45) Y{t) ^YM{t) + QN{t), 
where 

(2.46) QR. := K ={I - YY^)H 

is the compact QR- decomposition of K (Q n-by-p, R p-by-p), and M{t) and N{t) are p-by-p matrices 
given by the matrix exponential 
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Note that Eq. (2.47) is easily computed by solving a 2p-hy-2p skew-symmetric eigenvalue problem, 
which can be accomplished efficiently using the SVD or algorithms specially tailored for this problem [p6| . 



2.4.3. Parallel Translation (Stiefel). We now develop a notion of parallel transport that is con- 
siste nt wi th the canonical metric. The geodesic equation takes the form Y + T(Y,Y) = 0, where, from 
Eq. ( ^.41 ), it is seen that the Christoffel function for the canonical metric is 



(2.48) 



r,(A, A) = AA'^y + YA^{I - YY^)A. 



By polarizing we obtain the result 



(2.49) 



r,(Ai, A2) = i(Ai Af + A2Af )y + iF(A^(/ - YY^)A, 

+Af(/-ry^)A2). 



Parallel transport is given by the differential equation 



(2.50) 



A + rc(A,r) = 0, 



which is equivalent to Eq. ( 2.27 ). As stated after this equation, we do not have an 0{np^) method to 
compute A(t). 



2.4.4. The Gradient of a Function (Stiefel). Both conjugate gradient and Newton's method 
require a computation of the gradient of a function, which depends upon the choice of metric. For a 
function F{Y) defined on the Stiefel manifold, the gradient of F at y is defined to be the tangent vector 
VF such that 



(2.51) 



trFfA = 5c(VF, A) = tr(VF)^(/ - \YY'^)A 



for all tangent vectors A at Y , where Fy is the n-hy-p matrix of partial derivatives of F with respect to 
the elements of Y , i.e.. 



(2.52) 



OF 
dY~ 



Solving Eq. (2.51) for VF such that Y {VF) = skew-symmetric yields 



(2.53) 



VF = Fy - YF^Y. 



Eq. (2.5E) may also be derived by differentiating F{Y{t))^ where Y{t) is the Stiefel geodesic given by 
Eq. (215). 
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2.4.5. The Hessian of a Function (Stiefel). Newton's method requires the Hessian of a function, 
which depends upon the choice of metric. The Hessian of a function F{Y) defined on the Stiefel manifold 
is defined as the quadratic form 



(2.54) HessF(A,A)= |j 



F{Y{t)), 



t=o 



where Y(t) is a geodesic with tangent A, i.e., Y{0) = A. Applying this definition to F{Y) and Eq. ( 2.45| ) 
yields the formula 

(2.55) HessF(Ai, A2) = Fyy(Ai, A2) + i tr((i^f AiF^ + Y^AiF^)A2) 

^^tr{{Y^Fy+F^Y)A{llA,), 



where H = / — YY'^ , Fy is defined in Eq. (2.52), and the notation Fyy(Ai, A2) denotes the scalar 
Ey,fci {FYY)ijM (^1 )y i^2)ki , where 

d^F 

(2.56) {FYYhj.ki 



dY,,dYki 

This formula may also readily be obtained by using Eq. ( 2.50| ) and the formula 



(2.57) HessF(Ai, A2) = Fyy(Ai, A2) - tr r,(Ai, A2). 
For Newton's method, we must determine the tangent vector A such that 

(2.58) Hess F(A, X) = (-G, X) for all tangent vectors X, 

where G ~ VF. Recall that ( , ) = gd , ) in this context. We shall express the solution to this linear 
equation as A = — Hess~^ G, which may be expressed as the solution to the linear problem 

(2.59) Fyy(A) - y skew(FfA) - skew{AF^)Y - ^HAF'^Fy = -G, 

Y'^A = skew-symmetric, where skew(X) — {X — X^)/2 and the notation Fyy(A) means the unique 
tangent vector satisfying the equation 

(2.60) Fyy (A, X) = {Fyy{A),X) for all tangent vectors X. 
Example problems are considered in 

2.5. The Grassmann Manifold with its Canonical Metric. A quotient space representation 
of the Grassmann manifold was given in j: 2.3.2| ; however, for computations we prefer to work with n-hy-p 



orthonormal matrices Y. When performing computations on the Grassmann manifold, we will use the 
n-hy-p matrix Y to represent an entire equivalence class 

(2.61) [Y] = {YQp:QpeOp}, 

i.e., the subspace spanned by the columns of Y . Any representative of the equivalence class will do. 

We remark that an alternative strategy is to represent points on the Grassmann manifold with pro- 
jection matrices YY^ . There is one unique such matrix corresponding to each point on the Grassmann 
manifold. On first thought it may seem foolish to use parameters to represent a point on the Grassmann 
manifold (which has dimension p(n—p)), but in certain ab initio physics computations ]4^ , the projection 
matrices YY^ that arise in practice tend to require only 0{n) parameters for their representation. 
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Returning to the n-hy-p representation of points on the Grassmann manifold, the tangent space is 
easily computed by viewing the Grassmann manifold as the quotient space Gn,p = Vn^p/Op. At a point 
Y on the Stiefel manifold then, as seen in Eq. (2.5), tangent vectors take the form A = Y A + Y±B , where 
A is p-hy-p skew-symmetric, _B is (n — p)-hy-p, and Yl is any n-by-(n — p) matrix such that {Y, Yl) is 
orthogonal. From Eq. (2.61) it is clear that the vertical space at Y is the set of vectors of the form 



(2.62) 



$ = YA\ 



therefore, the horizontal space at Y is the set of vectors of the form 
(2.63) A = iiS. 

Because the horizontal space is equivalent to the tangent space of the quotient, the tangent space of the 
Grassmann manifold at \Y] is given by all n-hy-p matrices A of the form in Eq. (2.63) or, equivalently, 
all n-by-p matrices A such that 



(2.64) 



y'^A = 0. 



Physically, this correspo nds t o directions free of rotations mixing the basis given by the columns of Y . 

We already saw in i j2.3.2 that the Euclidean metric is in fact equivalent to the canonical metric for 
the Grassmann manifold. That is, for n-hy-p matrices Ai and A2 such that Y'^/S.i = (i = 1, 2), 

5,(Ai,A2)=trAf(/-iyr^)A2, 
= tr AfA2, 

= (/e(Ai,A2). 

2.5.1. Ge odes ies (Grassmann). A formula for geodesies on the Grassmann manifold was 
given via Eq. (2.32); the following theorem provides a useful method for computing this formula using 
n-hy-p matrices. ^ 

Theorem 2.3. IfY{t) = Qe*(B )/„,p, with Y{Q) = Y and r(0) = H, then 



(2.65) 



Y{t) = {YV U)['^^^]V^^ 



where UYiV"^ is the compact singular value decomposition of H. 

Proof 1. It is easy to check that either formulation for the geodesic satisfies the geodesic equation 
Y + Y{Y'^Y) — 0, with the same initial conditions. □ 

Proof 2. Lets = (Ui, J72) (g)^"^ be the singular value decomposition of _B [Ui n-hy-p^ U2 p-hy-{n—p), 
S and V p-hy-p) . A straightforward computation involving the partitioned matrix 



(2.66) 







V 

Ui U2 





verifies the theorem. □ 

A subtle point in Eq. ( ^.65 ) is that if the rightmost V'^ is omitted, then we still have a representative 
of the same equivalence class as Y{t); however, due to consistency conditions along the equivalent class 
[y(t)], the tangent (horizontal) vectors that we use for computations must be altered in the same way. 
This amounts to post-multiplying everything by V, or for that matter, any p-hy-p orthogonal matrix. 

The path length between Yq and Y{t) (distance between subspaces) is given by Q 



(2.67) 



d{Y{t),Yo) =t\\H\\i 



i=l 



1/2 



where ct^ are the diagonal elements of S. (Actually, this is only true for t small enough to avoid the issue 
of conjugate points, e.g., long great circle routes on the sphere.) An interpretation of this formula in 
terms of the CS decomposition and principal angles between subspaces is given in §4.3. 
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2.5.2. Parallel Translation (Grassmann). A formula for parallel translation along geodesies of 
complexity 0{np^) can also be derived: 

Theorem 2.4. Let H and A be tangent vectors to the Grassmann manifold at Y . Then the parallel 
translation of A along the geodesic in the direction Y{0) = H [Eq. (2.6L )] is 



(2.68) 



TA(t) ^[{YV U) 



— sin Si 
cos Yit 



+{I- UU' ) A 



A simple computation verifies that Eqs. ( 2.68 ) and ( 2.65 ) satisfy 



Proof 1. 
Eq. ( ^lel ). □ 

Proof 2. Parallel translation of A is given by the expression 



rA(t) = Qexpi 







(which follows from Eq. (^.37D), where Q = (F, li), H = 11 A, and A = Y^B. Decomposing ( ° ) 
as in Eq. ( 2.66 ) (note well that A has replaced B), a straightforward computation verifies the theorem. 
□ 

2.5.3. The Gradient of a Function (Grassmann ). W e must compute the gradient of a function 
F{Y) defined on the Grassmann manifold. Similarly to § 2.4.4 , the gradient of F at \Y] is defined to be 
the tangent vector VF such that 



(2.69) 



trF^A = griVF, A) = tr{VFfA 



for all tangent vectors A at Y, where Fy is defined by Eq. ( 2.52| ). Solving Eq. (2.69) for VF such that 
yr(VF) = yields 



(2.70) 



VF = Fy -YY^Fy. 



Eq. ( |2.7ClD m ay also be derived by differentiating F(Y{t)), where Y{t) is the Grassmann geodesic given 
by EqTWM). 

2.5.4. T he H essian of a Function (Grassmann). Applying the definition for the Hessian oi F(Y) 
given by Eq. ( 2.54 ) in the context of the Grassmann manifold yields the formula 



(2.71) 



HessF(Ai,A2) =Fyy(Ai,A2) -tr(Af AzF^Fy) 



where Fy and Fyy are defined in § 2.4.5 . For Newton's method, we must determine A = — Hess 
satisfying Eq. (2.58), which for the Grassmann manifold is expressed as the linear problem 



(2.72) 



Fyy{A) - A(Y^Fy) = -G, 



Y^A — 0, where Fyy ( A) denotes the unique tangent vector satisfying Eq. ( ^.60 ) for the Grassmann 
manifold's canonical metric. 

Example problems are considered in 93. 



2.6. Conjugate Gradient on Riemannian Manifolds. As demonstrated by Smith |7q, 76 



the 

benefits of using the conjugate gradient algorithm for unconstrained minimization can be carried over to 
minimization problems constrained to Riemannian manifolds by a covariant translation of the familiar 
operations of computing gradients, performing line searches, the computation of Hessians, and carrying 
vector information from step to step in the minimization process. In this section we will review the 
ideas in ||7^, |7^ and then in the next section we formulate concrete algorithms for conjugate gradient on 
the Stiefel and Grassmann manifolds. Here one can see how the geometry provides insight into the true 
difference among the various formulas that are used in linear and nonlinear conjugate gradient algorithms. 
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Figure 3.1 sketches the conjugate gradient algorithm in flat space and Figure 3.2 illustrates the 
algorithm on a curved space. An outline for the iterative part of the algorithm (in either flat or curved 
space) goes as follows: at the (fc — l)st iterate Xk-i, step to Xk, the minimum of / along the geodesic in 
the direction Hk-i, compute the gradient Gk = V/(a;fe) at this point, choose the new search direction to 
be a combination of the old search direction and the new gradient: 

(2.73) Hk = Gk+-fkTHk-u 



and iterate until convergence. Note that rHk-i in Eq. ( 2.73 ) is the parallel translation of the vector 
Hk-i defined in §2.2.3, which in this case is simply the direction of the geodesic (line) at the point Xk 
(see Figure 3.2). Also note the important condition that Xk is a minimum point along the geodesic: 

(2.74) {Gk,THk-i) = 0. 

Let us begin our examination of the choice of jk in fiat space before proceeding to arbitrary manifolds. 
Here parallel transport is trivial so that 

Hk ^ Gk+lkHk^i. 

In both linear and an idealized version of nonlinear conjugate gradient, 7^ may be determined by the 
exact conjugacy condition for the new search direction: 

fxx{Hk, Hk-i) = 0, 

i.e., the old and new search direction must be conjugate with respect to the Hessian of /. (With f^x = A, 
the common notation page 523] for the conjugacy condition is p'^_iApk = 0.) The formula for 7^ is 
then 

(2.75) Exact Conjugacy: jk ^ ~fxx{Gk, Hk-i)/ fxx{Hk-i, Hk-i). 

The standard trick to improve the computational efficiency of linear conjugate gradient is to use a 
formula relating a finite difference of gradients to the Hessian times the direction (r^ — rk-i = —akApk 
as in |4^). In our notation, 

(2.76) {Gk - Gfc_i, •) « a/,,(-, Uk-i). 
where a = \xk - 

The formula is exact for linear conjugate gradient on flat space, otherwise it has the usual error in finite 
difference approximations. By applying the finite difference formula Eq. ( 2.76| ) in both the numerator 
and denominator of Eq. (2.75), and also applying Eq. ( ^.74 ) twice (once with k and once with fc — 1), one 
obtains the formula 

(2.77) Polak-Ribiere: 7^ = {Gk - Gk-i,Gk) / {Gk-i,Gk-i) . 

Therefore the Polak-Ribiere formula is the exact formula for conjugacy through the Hessian, where one 
uses a difference of gradients as a finite difference approximation to the second derivative. If f{x) is well 
approximated by a quadratic function, then (Gfc_i, Gk) ~ 0, and we obtain 

(2.78) Fletcher-Reeves: -fk ^ {Gk,Gk}/{Gk-i,Gk-i}. 

For arbitrary manifolds, the Hessian is the second derivative along geodesies. In differential geometry 
it is the second covariant differential of /. Here are the formulas: 

(2.79) Exact Conjugacy: jk = - Hess/(G'fe, TiJfe_i)/ Hess/(Ti?fe_i, TiJfe_i) 

(2.80) Polak-Ribiere: 7^ = {Gk - TGk-i,Gk) / {Gk-i,Gk-i) 

(2.81) Fletcher-Reeves: jk = {Gk,Gk)/{Gk-uGk-i) 

which may be derived from the finite difference approximation to the Hessian, 

{Gk - rGfc^i, •) « aHess/(-,TiJfe_i), a = d{xk, Xk^i)/\\Hk-i\\. 
Asymptotic analyses appear in §B^. 
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3. Geometric Optimization Algorithms. The algorithms presented here are our answer to the 
question: "What does it mean to perform the Newton and conjugate gradient methods on the Stiefel and 
Grassmann manifolds?" Though these algorithms are idealized, they are of identical complexity up to 
small constant factors with the best known algorithms. In particular, no differential equation routines are 
used. It is our hope that in the geometrical algorithms presented here, the reader will recognize elements 
of any algorithm that accounts for orthogonality constraints. These algorithms are special cases of the 
Newton and conjugate gradient methods on general Riemannian manifolds. If the objective function is 
nondegenerate, then the algorithms are guaranteed to converge quadratically [76[ . 




Fig. 3.1. Conjugate gradient in flat space. 




Fig. 3.2. Conjugate gradient in curved space. 
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3.1. Newton's Method on the Grassmann Manifold. In flat space, Newton's method simply 
updates a vector by subtracting the gradient vector pre-multipUed by the inverse of the Hessian. The 
same is true on the Grassmann manifold (or any Riemannian manifold for that matter) of p-planes in 
n-dimensions, with interesting modifications. Subtraction is replaced by following a geodesic path. The 
gradient is the usual one (which must be tangent to the constraint surface), and the Hessian is obtained 
by twice differentiating the function along a geodesic. We show in §^!^ that this Hessian is related to the 
Hessian of the Lagrangian; the two Hessians arise from the difference between the intrinsic and extrinsic 
viewpoints. It may be suspected that following geodesies may not be computationally feasible, but because 
we exploit the structure of the constraint surface, this operation costs 0{np^), which is required even for 
traditional algorithms for the eigenvalue problem — our simplest example. 

Let F{Y) be a smooth function on the Grassmann manifold, i.e., F{Y) = F{YQ) for any p-hy-p 
orthogonal matrix Q, where Y is an n-by-p matrix such that Y^'^Y = Ip. We compute formulas for 



Fy and iVy(A) using the definitions given in §2.5.4. Newton's method for minimizing F{Y) on the 
Grassmann manifold is: 



Newton's Method for Minimizing F{Y) on the Grassmann Manifold 

• Given Y such that Y^Y ^ Ip, 

o Compute G = Fy - YY^Fy- 

o Compute A = — Hess^^ G such that Y'^A = and 

Fyy{A) - A{Y^Fy) = -G. 

• Move from Y in direction A to Y{\) using the geodesic formula 

Y{t) = cos(i]i)y^ + ;7sin(s<)y^ 

where UYjV^ is the compact singular value decomposition of A (meaning U is n-hy-p and both 
E and V are p-hy-p) . 

• Repeat. 



The special case of minimizing F{Y) = i tr Y^AY {A n-hy-n symmetric) gives the geometrically 
correct Newton method for the symmetric eigenvalue problem. In this case Fy — AY and Fyy{A) = 
(/ — YY'^)AA. The resulting algorithm requires the solution of a Sylvester equation. It is the idealized 
algorithm whose approximations include various forms of Rayleigh quotient iteration, inverse iteration, 
a number of Newton style methods for invariant subspace computation, and the many variations of 



Davidson's eigenvalue method. These ideas are discussed in §^Tj and 
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3.2. Newton's Method on the Stiefel Manifold. Newton's method on the Stiefel manifold is 



conceptually equivalent to the Grassmann manifold case. Let Y be an n-by-p matrix such that Y^Y = /. 



pi 



and let F{Y) be a smooth function of Y , without the homogeneity condition imposed for the Grassmann 
manifold case. Compute formulas for Fy and Fyy{^) using the definitions given in §2.4.5, Newton's 
method for minimizing F{Y) on the Stiefel manifold is: 



Newton's Method for Minimizing F{Y) on the Stiefel Manifold 



• Given Y such that F^F = /p, 

o Compute Fy- YF^Y. 

o Compute A = — Hess~^ G such that Y^A — skew-symmetric and 

Fyy{A) -Yskew{F^A) ~ skew{AFf)Y - inAF^Fy = -G, 

where skew(X) ^ {X - X^)/2 and H = / - YY^ . 

• Move from Y in direction A to 1^(1) using the geodesic formula 

Y{t) = YM{t) + QN{t) 

where QR is the compact QR decomposition of (/ — YY'^)A (meaning Q is n-hy-p and R 
is p-by-p), A = Y'^A, and M{t) and N{t) are p-hy-p matrices given by the 2p-by-2p matrix 
exponential 



• Repeat. 



For the special case of minimizing F{Y) ~ ^ ivY'^AYN [A n-hy-n symmetric, N p-hy-p symmetric) 
Fy = AYN and Fyy{A) = AAN - YNA'^AY. Note that if N is not a multiple of the identity, 
then F{Y) does not have the homogeneity condition required for a problem on the Grassmann manifold. 
If = diag(p,p — 1, . . . , 1), then the optimum solution to maximizing F over the Stiefel manifold yields 
the eigenvectors corresponding to the p largest eigenvalues. 

For the orthogonal Procrustes problem |32|, F{Y) — \\\AY — [A m-hy-n, B m-hy-p, both 

arbitrary), Fy = A^AY - A^ B and Fyy(A) = A'^AA - YA'^A^AY. Note that r^Fyy(A) = skew- 
symmetric. 
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3.3. Conjugate Gradient Method on the Grassmann Manifold. Conjugate gradient tech- 
niques are considered because they are easy to implement, have low storage requirements, and provide 
superlinear convergence in the limit. The Newton equations may be solved with finitely many steps of 
linear conjugate gradient; each nonlinear conjugate gradient step, then, approximates a Newton step. In 
flat space, the nonlinear conjugate gradient method performs a line search by following a direction deter- 
mined by conjugacy with respect to the Hessian. On Riemannian manifolds, conjugate gradient performs 
minimization along geodesies with search directions defined using the Hessian described above ||7^, [76[ . 
Both algorithms approximate Hessian conjugacy with a subtle formula involving only the gradient di- 
rections, resulting in an algorithm that captures second derivative information by computing only first 
derivatives. To "communicate" information from one iteration to the next, tangent vectors must parallel 
transport along geodesies. Conceptually, this is necessary because, unlike flat space, the definition of 
tangent vectors changes from point to point. 



Using these ideas and formulas developed in §3.1, the conjugate gradient method on the Grassmann 
manifold is: 



Conjugate Gradient for Minimizing F{Y) on the Grassmann Manifold 

• Given Yq such that Y^Yq = /, compute Go = Fyg — YoY^Fyg and set Hq = — Gq. 

• For fc = 0, 1, . . ., 

o Minimize F{Yk{t)) over t where 

Y{t) = YV cos{Y.t)V'^ + Usm{Y,t)V'^ 

and UJ^V'^ is the compact singular value decomposition of H^- 
o Set tk ^ tmin and Yk+i = Yk{tk). 
o Compute Gfc+i = Fy^^^ - Yfe+i Y^^^Fy^^^ . 

o Parallel transport tangent vectors Hk and Gk to the point Yk+v 

(3.1) TiJfc = (-YfcFsinEtfc + C/cosStfc)!]^'^, 

(3.2) TGk =Gk~ {YkVsin^tk + U{I - cos^tk))U^Gk. 

o Compute the new search direction 

{Gk+i — TGk, Gk+i) 



Hk+i = -Gk+i + ikTHk where 7fc 

and (Ai, Aa) = tr A^fAj. 
o Reset Hk+i ~ —Gk+i if fc + 1 = mod p(n — p). 



{Gk,Gk) 



3.4. Conjugate Gradient Method on the Stiefel Manifold. As with Newton's method, con- 
jugate gradient on the two manifolds is very similar. One need only replace the definitions of tangent 
vectors, inner products, geodesies, gradients, and parallel translation. Geodesies, gradients, and inner 
products on the Stiefel manifold are given in §2.4. For parallel tr ansl ation along geodesies on the Stiefel 
manifold, we have no simple, general formula comparable to Eq. (3.2). Fortunately, a geodesic's tangent 
direction is parallel, so a simple formula for rHk comparable to Eq. ( |3.l|) is available, but a formula for 
rGfc is not. In practice, we recommend setting rGk ■= Gk and ignoring the fact that rGk will not be 
tangent at the point Yk+i- Alternatively, setting rGk ■— (also not parallel) results in a Fletcher- Reeves 
conjugate gradient formulation. As discussed in the next section, either approximation does not affect 
the superlinear convergence property of the conjugate gradient method. 

The conjugate gradient method on the Stiefel manifold is: 
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Conjugate Gradient for Minimizing F{Y) on the Stiefel Manifold 

• Given Yo such that Y^Yq = I, compute Go — Fyg — FqF^Fo and set Hq = —Gq. 

• For fc = 0, 1, . . . , 

o Minimize F{Yk{t)) over t where 

Yk{t) =YkM{t) + QNit), 

QR is the compact QR decomposition of [I — YkY^)Hk^ A = Y^ H^, and M{t) and N{t) are 
p-hy-p matrices given by the 2p-hy-2p matrix exponential appearing in Newton's method 



on the Stiefel manifold in 53.2 



o Set tk = tmin and Yk+i = Yk{tk). 

o Compute Gfe+i = Fyi^^^ - Yk+iF^^^^Yk+i. 

o Parallel transport tangent vector to the point Yfc+i: 

(3.3) THk = HkM{tk) - YkR^N{tk). 

As discussed above, set rGk ■— Gk or 0, which is not parallel, 
o Compute the new search direction 

Tj r< , u V, {Gk+i - TGk,Gk+i) 

Hk+1 = -Gfc+i + jkTHk where 7fe = - 



(Gfc, Gfc) 

and (Ai,A2) =trAf(/-iry^)A2. 
o Reset Hk+i = —Gk+i if + 1 = mod p{n — p) + p[p — l)/2. 



Orthogonality Constraints 



23 



3.5. Numerical Results and Asymptotic Behavior. 



3.5.1. Trace Maximization on the Grassmann Manifold. The convergence properties of the 
conjugate gradient a nd N ewton's methods applied to the trace maximization problem F{Y) — tr Y^AY 
are shown in Figure 3.3, as well as the convergence of an approximate conjugate gradient method and 
the Rayleigh quotient iteration for comparison. This example shows trace maximization on Gs^a, i.e., 
3-dimensional subspaces in 5 dimensions. The distance between the subspace and the known optimum 
subspace is plotted versus the iteration number, where the distance in radians is simply the square root 
of the sum of squares of the principal angles between the subspaces. The dimension of this space equals 
3(5 — 3) = 6; therefore, a conjugate gradient algorithm with resets should at least double in accuracy 
every six iterations. Newton's method, which is cubically convergent for this example (this point is 
discussed in |4.l| ), should triple in accuracy every iteration. Variable precision numerical software is used 
to demonstrate the asymptotic convergence properties of these algorithms. 
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Fig. 3.3. Convergence of the conjugate gradient and Newton's method for trace maximization on the Grassmann 
m.ani fnld. Ci ^^ t. The error (in radians) is the arc length distance between the solution and the subspace at the ith iterate 
[Eq. (2.6',) and \4.Sl. Quadratic convergence of CG is evident, as is cubic convergence of Newton's method, which is a 
special property of this example. 



The thick black curve (CG-1) shows the convergence of the conjugate gradient (CG) algorithm using 
the Polak-Ribiere formula. The accuracy of this algorithm is at least doubled between the first and sixth 
and the seventh and twelfth iterations, demonstrating this method's superlinear convergence. Newton's 
method is applied to the twelfth CG iterate, which results in a tripling of the accuracy and demonstrates 
cubic convergence of Newton's method, shown by the dashed thick black curve (NT-1). 

The thin black curve (CG-2) shows CG convergence using the Fletcher-Reeves formula 



(3.4) 



Ik — {Gk+i,Gk+i)/ {Gk,Gk)- 



As discussed below, this formula differs from the Polak-Ribiere formula by second order and higher 
terms, so it must also have superlinear convergence. The accuracy of this algorithm more than doubles 
between the first and sixth, seventh and twelfth, and thirteenth and eighteenth iterations, demonstrating 
this fact. 
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The algorithms discussed above are actually performed on the constraint surface, but extrinsic approx- 
imations to these algorithms are certainly possible. By perturbation analysis of the metric given below, it 
can be shown that the CG method differs from its flat space counterpart only by cubic and higher terms 
close to the solution; therefore, a flat space CG method modified by projecting search directions to the 
constraint's tangent space will converge super linearly. This is basically the method proposed by Bradbury 
and Fletcher [|j and others for the single eigenvector case. For the Grassmann (invariant subspace) case, 
we have performed line searches of the function (/)(t) = ti Q{t)^AQ{t), where Q{t)R{t) := Y + tA is the 
compact QR decomposition, and y-^A = 0. The QR decomposition projects the solution back to the 
constraint surface at every iteration. Tangency of the search direction at the new point is imposed via 
the projection / — YY^ . 

The thick gray curve (CG-3) illustrates the superlinear convergence of this method when the Polak- 
Ribiere formula is used. The Fletcher- Reeves formula yields similar results. In contrast, the thin gray curve 
(CG-4) shows convergence when conjugacy through the matrix A is used, i.e., -fk — —{H'[AGk+i)/ {H'^AHk),^ 
which has been proposed by several authors Eq. (5)], jl^, Eq. (32)], Eq. (20)]. This method can- 
not be expected to converge superlincarly because the matrix A is in fact quite different from the true 



Hessian on the constraint surface. This issue is discussed further in §4.4 




ITERATION 



Fig. 3.4. Convergence of the conjugate gradient and Newton's method for the orthogonal Procrustes problem on 



the Stiefel manifold Vs^s. The error is the Frobenius norm between the ith iterate and the known solu 
convergence of the CG and Newton methods is evident. The Newton iterates correspond to those of Table 



on . Quadratic 



To compare the performance of Newton's method to the Rayleigh quotient iteration (RQI), which 
approximates Newton's method to high order (or vice versa), RQI is applied to the approximate CG 
method's twelfth iterate, shown by the dashed thick gray curve (NT-2). 

3.5.2. Orthogonal Procrustes Problem on the Stiefel Manifold. The orthogonal Procrustes 
problem ||3^ 

(3.5) min H^F — A, _B given matrices. 



is a minimization problem defined on the Stiefel manifold that has no known analytical solution for 
p different from 1 or n. To ensure that the objective function is smooth at optimum points, we shall 
consider the equivalent problem 

(3.6) ^mm l\\AY - B\\l. 
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-0.44239067350975 
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0.38145454055699 

-0.42204935150912 
0.92362255783368 

-1.52598136000449 



0.00742708895676 
0.14195063498923 
1.75394028742695 
0.62648865033822 
0.89515519875598 



-0.09653196026400 > 
-0.16309797180034 
-0.63865179066515 
-0.31561702752866 
0.87362106204727/ 



Table 3.1 

Newton's method applied to the orthogonal Procrustes problem on the Stiefel manifold using the MATLAB code given 
in this section. The matrix A is given below the numerical results, and B = A/s^a. The quadratic convergence of Newton's 
method, shown by the Frobenius norm of the difference between Yi and Y = /s^s, is evident. This convergence is illustrated 
in Figure 3.4. It is clear from this example that the difference Yi — Y approaches a tangent vector at Y = In,p, i.e., 
Y"^ (Yi — Y) ^ skew-symmetric. 



Derivatives of this function appear at the end of 
to this problem appears below 



B is illustrated in Table 3.1 and Figure 
conjugate gradient algorithm is evident 



p.2| . MATLAB code for Newton's method applied 
Convergence of this algorithm for the case V^^^ and test matrices A and 
The quadratic convergence of Newton's method and the 
The dimension of ¥5^3 equals 3(3 — l)/2 -|- 6 = 9; therefore, the 



accuracy of the conjugate gradient should double every nine iterations, as it is seen to do in Figure p. 4 . 
Note that the matrix B is chosen such that a trivial solution Y = In,p to this test optimization problem 
is known. 



MATLAB Code for Procrustes Problem on the Stiefel Manifold 

n = 5; p = 3; 

A = randn(n) ; 
B = A+eye(n,p) ; 

YO = eyeCn,p); % Knoxm solution YD 

H = 0.1*randn(n,p) ; H = H - YO*(H'*YO); '/. small tangent vector H at YO 
Y = stiefgeod(Y0,H) ; % Initial guess Y (close to know solution YO) 

°/g Newton iteration (demonstrate quadratic convergence) 
d = norm(Y-YO, 'fro') 
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while d > sqrtCeps) 

Y = stiefgeod(Y,procrnt(Y,A,B)) ; 
d = norm (Y-YO, 'fro') 

end 



function stiefgeod 



function [Yt,Ht] = stiefgeod(Y,H,t) 
•/.STIEFGEOD Geodesic on the Stiefel manifold. 



ST1EFGE0D(Y,H) is the geodesic on the Stiefel manifold 
emanating from Y in direction H, where Y'*Y = eye(p), Y'*H = 
skew-hermitian, and Y and H are n-by-p matrices. 

STIEFGEOD (Y,H,t) produces the geodesic step in direction H scaled 
by t. [Yt.Ht] = STIEFGEOD(Y,H,t) produces the geodesic step and the 
geodesic direction. 



[n,p] = sizeCY) ; 

if nargin < 3 , t = 1 ; end 
A = Y'*H; A = (A - A')/2; 
[C1,R] = qrCH - Y*A,0) ; 
MN = expmCt* [A,-R' ;R,zeros(p)] ) 



°/o Ensure skew-symmetry 
MK = MN(:,l:p); 



Yt = Y*MN(l:p,:) + Q*MN(p+l : 2*p, : ) ; Z Geodesic from Eq. (2.4$ 
if nargout > 1 , Ht = H*MN(l:p,:) - Y* CR' *MN(p+l : 2»p, : ) ) ; end 



°/ Geodesic direction from Eq. C3.3) 



function procrnt 



function H = procrnt CY, A, B) 

'/.PROCRNT Newton Step on Stiefel Manifold for l/2*norm(A*Y-B , 'f ro ' ) "2 . 

•/. H = PROCRNT(Y,A,B) computes the Newton step on the Stiefel manifold 

'/. for the function l/2*norm(A*Y-B , ' f ro ' ) "2 , where Y'*Y = eye(si2e(Y,2)) . 

[n,p] = size CY) ; 

AA = A'*A; FY = AA*Y - A'*B; YFY = Y'*FY; G = FY - Y*YFY'; 

7. Linear conjugate gradient to solve a Newton step 

dimV = p*(p-l)/2 + p*(n-p); 7. == dim Stiefel manifold 

7. This linear CG code is modified directly from Golub and Van Loan [ ^sj ] 
H = zeros (size (Y) ) ; Rl = -G; P = Rl ; PO = zeros (size (Y) ) ; 
for k=l:dimV 

normRl = sqrt (stief ip(Y,Rl ,R1) ) ; 

if normRl < prodCsize (Y) ) *eps , break; end 

if k == 1, beta = 0; else, beta = (normRl/normRO) "2 ; end 

PO = P; P = Rl + beta*P; FYP = FY'*P; YP = Y'*P; 

LP = AA*P - Y*(P'*AA*Y) ... 7. Linear operation on P 

- Y*((FYP-FYP')/2) - (P*YFY'-FY*YP')/2 - (P-Y*YP) * (YFY/2) ; 

alpha = normRl"2/stief ip(Y,P,LP) ; H = H + alpha*P; 

RO = Rl; normRO = normRl; Rl = Rl - alpha*LP; 

end 



function stiefip 



function ip = stief ip(Y,A,B) 

7.STIEFIP Inner product (metric) for the Stiefel manifold. 

7. ip = STIEFIP (Y, A, B) returns trace (A' * (eye (n) -1/2*Y*Y' ) •B) , 

7. where Y'*Y = eye(p), Y'*A & Y'*B = skew-hermitian, and Y, A, 

7. and B are n-by-p matrices. 



ip = sum(sum(conj (A) .*(B - Y* ( (Y' *B) /2) ) ) ) ; 7. Canonical metric from Eq. (2.39) 
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3.6. Convergence Rates of Approximate Methods. The algorithms presented in the previous 
sections are idealized in that geometrically natural ideas such as geodesies and parallel translation are 
used in their definitions. However, approximations can yield quadratic rates of convergence. In the limit, 
the Riemannian algorithms approach their Euclidean counterparts in the tangent plane of the solution 
point. A perturbation analysis shows which terms are necessary and which terms are not necessary to 
achieve quadratic convergence. The following argument holds for any Riemannian manifold, and therefore 
applies to either the Grassmann or Stiefel manifold case. 

Consider the CG method applied to a function F{Y) starting at a point Y within distance e (small) 
of the solution Y. For a manifold of dimension d, we must perform a sequence of d steps that take us 
within distance O(e^) of the solution Y. The Riemannian CG method 

rj \ U (Gncw '^Goldi Gy^cw) 

^ncw — ^ncw r '^^-tj-old: T — ITT^ no ' 

l|Goidlr 

Yna^^YiUin), y(o) = yoid, r(o) = ffncw; 

does this, but we wish to approximate this procedure. Within a ball of size 0(e) around Y, these quantities 
have sizes of the following orders: 



Order Quantity 



0(1) ^min, 7 

0(e) G, H (new or old) 

0(e2) ||Gf , (newor old) 

0{e^) (rGoidiGnow) 



Also, by perturbation analysis of the Riemannian metric ||l8|, |7^, Vol. 2, Chap. 4, Props. 1 and 6], we have 

r(e) =y(0)+eA + O(e3) 
rG(e) = G + 0(e2) 
{■.)=I + 0{e') 

where ^(e) is a geodesic in direction A, TG{e) is parallel translation of G along ^(e), and the last 
approximation is valid for an orthonormal basis of the tangent plane at Y(eA) and / is the identity. 

Inserting these asymptotics into the formulas for the CG method show that near the solution, eliminat- 
ing the Riemannian terms gives O(e^) perturbations of the CG sequence, and therefore does not affect the 
quadratic rate of convergence. Furthermore, it can also be seen that eliminating the Polak-Ribiere term 
— (rGoid, Gnow)/||Goid|P, yielding the Fletcher-Reeves algorithm, perturbs the CG sequence by O(e^) 
terms, which does not affect the quadratic rate of convergence. Thus the approximate CG methods 



discussed in B.5.1 converge quadratically. 

4. Examples: Insights and AppHcations. In this section, we consider ideas from the literature as 
applications of the framework and methodology developed in this paper. It is our hope that some readers 
who may be familiar with the algorithms presented here, will feel that they now really see them with a new 
deeper, but ultimately clearer understanding. It is our further hope that developers of algorithms that 
may somehow seem new will actually find that they also already fit inside of our geometrical framework. 
Finally, we hope that readers will see that the many algorithms that have been proposed over the past 
several decades are not just vaguely connected to each other, but are elements of a deeper mathematical 



structure. The reader who sees the depth and simplicity of §4.10, say, has understood our message. 



4.1. Rayleigh Quotient Iteration. If yl is a symmetric matrix, it is well known that Rayleigh 
quotient iteration (RQI) is a cubically convergent algorithm. It is easy to derive formulas and show that 
it is true, here we will explain our view of why it is true. Let r{x) denote the Rayleigh quotient x^Ax 
and, abusing notation, let r{9) denote the Rayleigh quotient on a geodesic with 9 — corresponding to 
an eigenvector of A. 

Here is the intuition. Without writing down any formulas, it is obvious that r(9) is an even function 
of 9; hence = is an extreme point. Newton's optimization method, usually quadratically convergent, 
converges cubically on nondegenerate even functions. Keeping in mind that A — r{x)I is the second 
covariant derivative of the Rayleigh quotient, inverting it must amount to applying Newton's method. 
Following this intuition, RQI must converge cubically. The intution is that simple. 
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Fig. 4.1. Cubic convergence of RQI and Newton's method applied to Rayleigh's quotient. The vector is an eigenvector. 



Indeed, along a geodesic, r{6) — Acos^ 9 + asin^ 9 (we ignore the degenerate case a = A). The fcth 
step of Newton's method for the univariate function r{9) is readily verified to be 

Ok+i =0k-\ tan(20,O = -^9l + 0{9l). 

We think of updating 9 as moving along the circle. If we actually moved tangent to the circle by the Newton 
update — itan(20fe) and then projected to the circle, we would have the Rayleigh quotient iteration 

9k+i =9k~ arctan(i tan(20fe)) - -9l + 0{9l). 

This is the mechanism that underlies Rayleigh quotient iteration. It "thinks" Newton along the geodesic, 
but moves along the tangent. The angle from the eigenvector goes from 9 to —9^ almost always. (Readers 
comparing with Parlett p5| , Eq. (4-7-3)] will note that only positive angles are allowed in his formulation.) 

When discussing the mechanism, we only needed one variable: 9. This is how the mechanism should 
be viewed because it is independent of the matrix, eigenvalues, and eigenvectors. The algorithm, however, 
takes place in x space. Since A — r{x)I is the second covariant derivative of r{x) in the tangent space at x, 
the Newton update 5 is obtained by solving Ii{A — r{x)I)5 = —HAx = —{A — r{x)I)x, where 11 — I — xx^ 
is the projector. The solution is (5 = —x + y/{x'^y), where y = {A — r{x)I)~^x. The Newton step along 
the tangent direction is then x ^ x + 6 — y/{x'^y), which we project to the unit sphere. This is exactly 



an RQI step. These ideas are illustrated in Figure 4.1 



One subtlety remains. The geodesic in the previous paragraph is determined by x and the gradient 
rather than x and the eigenvector. The two geodesies converge to each other by the inverse iteration 
process (almost always) allowing the underlying mechanism to drive the algorithm. 

One trivial example where these issues arise is the generalization and derivation of Davidson's method 
[ [74| [2^ , |2^ . In this context there is some question as to the interpretation of D — A/ as a preconditioner. 
One interpretation is that it preconditions the eigenproblem by creating better eigenvalue spacings. We 
believe that there is a more appropriate point of view. In linear conjugate gradient for Ax = &, precon- 
ditioners are used to invert M which is an approximation to A (the Hessian of ^x'^Ax — x'^b) against 
the gradient. This is an approximate Newton step. In nonlinear conjugate gradient, there is no consen- 
sus as to whether inverting the Hessian (which is approximated hy D — A/!) would constitute the ideal 
preconditioner, but it is a Newton step. Therefore with the link between nonlinear conjugate gradient 
preconditioning and approximate Newton step, we see that Davidson's method is deserving of being called 
a preconditioner from the conjugate gradient point of view. 
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4.2. Coordinate Singularities of Symmetric Matrices. An important open problem in numer- 
ical linear algebra is the complete understanding of the influence of singularities on computations ^ . 
In this section we shall describe the singularity associated with multiple eigenvalues of symmetric matrices 



in terms of coordinate singularities, i.e., the breakdown of the coordinate representation. In §4.10, we will 
describe how understanding this coordinate singularity underlies a regularization approach to eigenvalue 
optimization. 

Matrix factorizations are nothing more than changes in variables or coordinate changes. In the plane, 
Cartesian and polar coordinates both give orthogonal systems, but polar coordinates have a coordinate 
singularity at the origin. A small perturbation near the origin can violently change the angle coordinate. 
This is ill-conditioning. If the r coordinate goes through the origin we have a singularity of the form \r\. 

Consider traceless, symmetric, 2-by-2 matrices: 

y -X 



The positive eigenvalue is r = ■\Jx^ + y^, and one of the orthogonal eigenvectors is J^), where tan 9 = 

y/x. The conversion between matrix elements and the eigendecomposition is exactly the conversion from 
Cartesian to polar coordinates. Whatever ill-conditioning one associates with a symmetric matrix with 
two close eigenvalues, it is the same numerical difficulty associated with the origin in polar coordinates. 
The larger eigenvalue behaves like |r| at the origin, and the eigenvector behaves like 9 changing violently 
when perturbed. If one wants to think about all 2-by-2 symmetric matrices, add z as the trace, and the 
resulting interpretation is cylindrical coordinates. 

We now generalize. Let Sn be the space of n-by-n symmetric matrices. Suppose that the largest 
p eigenvalues Ai, . . . , Ap coalesce. The corresponding eigenvectors are not uniquely determined, but the 
invariant subspace is. Convenient parameterizations are 

Sn = Symmetric Matrices — x Vn^p x Sn-p 
Sn,p = { Sn ■ ^1 has multiplicity p} ~ R x Gn,p x Sn-p 

That is, any symmetric matrix may be parameterized by its p largest eigenvalues, the corresponding 
eigenvectors in order, and the (n — p)-hy-{n — p) symmetric operator on the space orthogonal to these 
eigenvectors. To parameterize a symmetric matrix with eigenvalue A of multiplicity p, we must specify 
the invariant subspace corresponding to this eigenvalue and, once again, the {n — p)-hy-{n — p) symmetric 
operator on the orthogonal subspace. It is worth mentioning that the parameters in these decompositions 
give an orthonormal system (surfaces with constant parameters intersect orthogonally). The codimension 
of Sn,p in Sn is p{p + l)/2 — 1, obtained by adding p — I (corresponding to A2, . . . , Ap) to p{p — l)/2 (the 
codimension of Gn,p in V^,p). 

Another interpretation of the well-known fact that when eigenvalues coalesce, eigenvectors, but not 
invariant subspaces, are ill-conditioned, is that the Stiefel manifold collapses to the Grassmann manifold. 
As with polar coordinates we have a coordinate singularity corresponding to ill-conditioning near Sn,p- 
Near this set, a small perturbation will violently move the Stiefel component. The singularity associated 
with the coallescing of eigenvalues is very much the singularity of the function f{x) = \x\. 

4.3. The CS Decomposition. The CS decomposition |^ should be recognized as the geodesic 
between two points on the Grassmann manifold. Any n-by-n orthogonal matrix Q may be written as 



(4.1) Q = 



c 


-s 





s 


c 











/ 



f/ / „ r. , \ U 



for some p-hy-p orthogonal matrices V and V and {n — p)-hy-{n — p) orthogonal matrices U and U, and 
p angles 9i where C = diag (cos^i, . . . , cos 9p) and S = diag(sin0i, . . . ,sin0p). Comparing this with the 



geodesic formula Eq. ( 2.65 ) and letting 9i = tui ii = 1, . . . , p) where Ui are the diagonal elements of S, 
we see that the first p columns of the CS decomposition traverse a geodesic emanating from F(0) = (q) 
(the origin). The next p columns give an orthogonal basis for the velocity vector along the geodesic (in 
fact, they are the orthogonal component of its polar decomposition). 
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As is well known, the 6i are the principal angles between the subspaces spanned by the first p columns 
of Q and the origin. In general, let 6'^ (i = 1, ■ ■ ■ , p) be the principal angles between the two subspaces 
spanned by the columns of n-hy-p orthonormal matrices Yi and I2, i-e., C/(cos 9)T^"^ is the singular value 
decomposition of ^1^12, where Q is the diagonal matrix of principal angles. Also let and sin6' represent 
the p- vectors formed by the 9i and sin^^. These principal angles provide several different definitions of 
the distance between two subspaces: 

1. arc length: d{Yi,Y2) = \\9\\2 

2. Fubini-Study: d-Fs{Yi, Y2) — arccos | det Y]^l2| = arccos(J|j cosOi) 

3. chordal 2-norm: dc2(yL,>2) = \\YiU - ^2^112 = ||2sini0||oo 

4. chordal Frobenius-norm: dcpiYi^Y^) = \\YiU ~ Y2V\\f = ||2 sin 16*112 

5. projection 2-norm ||: dp2{Yi,Y2) = \\Y^Y^ - 1^2^/112 = || sin0|U 

6. projection F-norm: d.pF{Yi.Y2) = 2~^/^\\Y{Yl - Y2Y:[\\f = || sin6i||2 

The arc length distance is derived from the intrinsic geometry of the Grassmann manifold. The 
chordal 2-norm and Frobenius-norm distances are derived by embedding the Grassmann manifold in the 
vector space R"^p, then using the the 2- and Frobenius- norms, respectively, in these spaces. Note that 
these distances may be obtained from the minimization problems 

dc2 or cf{Yi,Y2) = miu \\YiQi - Y2Q2\\2 or _F- 

The projection matrix 2-norm and Frobenius-norm distances are derived by embedding the Grassmann 
manifold in the set of n-by-n projection matrices of rankp, then using the 2- and Frobenius-norms, respec- 
tively. The Fubini-Study distance is derived via the Pliicker embedding of Gn,p into the projective space 
P(/\^(R")) (by taking wedge products between all columns of Y), then using the Fubini-Study metric 
[p4|.|^ Note that all metrics except the chordal and projection matrix 2-norm distances are asymptotically 
equivalent for small principal angles, i.e., these embeddings are isometrics, and that for Yi ^ Y2 we have 
the strict inequalities 

d{Yi,Y2) > d^s{Yi,Y2), (4.2) 
d{Yi,Y2) > dcAYi,Y2) > dpF{Yi,Y2), (4.3) 
d{Yi,Y2) > dcF{YuY2) > dc2{Yi,Y2), (4.4) 
d{Yi,Y2) > dpF{Yi,Y2) > dp2{Yi,Y2). (4.5) 

These inequalities are intuitively appealing because by embedding the Grassmann manifold in a higher 
dimensional space, we may "cut corners" in measuring the distance between any two points. 

4.4. Conjugate Gradient for the Eigenvalue Problem. Conjugate gradient algorithms to min- 
imize ^y'^Ay [A symmetric) on the sphere {p — 1) is easy and has been proposed in many sources. The 
correct model algorithm for p > 1 presented in this paper is new. We were at first bewildered by the 
number of variations [|, |, ||, ||, |, ||, ^ ||, ||, |^, || ||, most of which propose 



algorithms for conjugate gradient for the eigenvalue problem. Most of these algorithms are for computing 
extreme eigenvalues and corresponding eigenvectors. It is important to note that none of these methods 
are equivalent to Lanczos It seems that the correct approach to the conjugate gradient algorithm 
for invariant subspaces {p > I) has been more elusive. We are only aware of three papers ^ that 

directly consider conjugate gradient style algorithms for invariant subspaces of dimension p > 1. None of 
the proposed algorithms are quite as close to the new idealized algorithms as the p = 1 algorithms are. 
Each is missing important features which are best understood in the framework that we have developed. 
We discuss these algorithms below. 

The simplest non-trivial objective function on the Grassmann manifold G„^p is the quadratic form 

F{Y) = \rY^AY, 

where A is a symmetric n-by-n matrix. It is well known that the solution to the minimization of F is the 
sum of the p smallest eigenvalues of A, with an optimal Y providing a basis for the invariant subspace 
corresponding to the p smallest eigenvalues. 



We thank Keith Forsythe for reminding us of this distance. 
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To solve the eigenvalue problem, one may use the template directly from §3.3 after deriving the 
gradient 



VF{Y) = AY - Y{Y'^AY) 



and the second covariant derivative of F{Y): 



Hessi^(Ai, A2) = tr(Af^A2 - (Af As)^^^^) 



The line minimization problem may be solved as p separate two-by-two problems in parallel, or it may 
be solved more completely by solving the 2p-by-2p eigenvalue problem. This does not follow the geodesic 
directly, but captures the main idea of the block Lanczos algorithm which in some sense is optimal |2^, ^ . 

If one is really considering the pure linear symmetric eigenvalue problem then pure conjugate gradient 
style procedures must be inferior to Lanczos. Every step of all proposed non-preconditioned conjugate 
gradient algorithms builds vectors inside the same Krylov space in which Lanczos gives an optimal solution. 
However, exploring conjugate gradient is worthwhile. When the eigenvalue problem is non-linear or the 
matrix changes with time, the Lanczos procedure is problematic because it stubbornly remembers past 
information that perhaps it would do well to forget. (Linear conjugate gradient, by contrast, benefits from 
the memory of this past information.) Applications towards non- linear eigenvalue problems or problems 
that change in time drive us to consider the conjugate gradient method. Even the eigenvalue problem 
still plays a worthy role: it is the ideal model problem that allows us to understand the procedure much 
the way the Poisson equation on the grid is the model problem for many linear equation solvers. 

Conjugate gradient on the sphere [p — 1) computes the smallest eigenvalue of a symmetric matrix 
A. Two papers |]67| , [l9|] consider imposing conjugacy through A. This is an unfortunate choice by itself 
because A is quite different from the Hessian A — r{x)I, where r(a;)_is the Rayleigh quotient. A few authors 
directly consider conjugacy through the unconstrained Hessian [ pO] , Others attempt to approximate 
conjugacy through the Hessian by using Polak-Ribiere or Fletcher-Reeves [|[ |3|, ||, |||, |3|, ^, ||, ||. 
It is quite possible that most of these variations might well be competitive with each other and also our 
idealized algorithm, but we have not performed the numerical experiments because ultimately the p = 1 
case is so trivial. A comparison that may be of more interest is the comparison with restarted Lanczos. We 
performed an informal numerical experiment that showed that the conjugate gradient method is always 
superior to two step Lanczos with restarts (as it should be since this is equivalent to the steepest descent 
method), but is typically slightly slower than four step Lanczos. Further experimentation may be needed 
in practice. 

Turning to the p > I case, the three papers that we are aware of are |, |6). The algorithm 
proposed in Alsen 1^, has a built-in extra feature not in the idealized algorithm. Though this may not 
be obvious, it has one step of orthogonal iteration built in. This may be viewed as a preconditioning 
procedure giving the algorithm an advantage. The Sameh-Wisniewski [fzol algorithm begins with many 
of the ideas of an idealized Grassmann algorithm, including the recognition of the correct tangent on the 
Grassmann manifold (though they only mention imposing the Stiefel constraint). Informal experiments 
did not reveal this algorithm to be competitive, but further experimentation might be appropriate. The 
more recent Fu and Dowling algorithm imposes conjugacy through A and therefore we do not expect 
it to be competitive. 
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4.5. Conjugate Gradient for the Generalized Eigenvalue Problem. It is well known that 
the generalized eigenvalue problem Ax — XBx may also be posed as a constrained optimization problem. 
Now we must find 



subject to the constraint that 



min tr Y'^AY 
Y^BY = /„ 



With the change of variables 

(4.6) Y = B^I'^Y 

(4.7) A = S^/^A 

(4.8) A = B^^/^AB-^/^ 
the problem becomes 

mintrF^AF subject to F^F = /p. 

The numerical algorithm will be performed on the non-overlined variables, but the algorithm will be 
mathematically equivalent to one performed on the overlined variables. 

Notice that the condition on tangents in this new coordinate system is that 

A'^sr = 0. 

It is readily checked that the gradient of the trace minimization problem becomes 

G = {B-^ - YY^)AY 

(note that G^BY = 0). 

Geodesies may be followed in any direction A for which t^BY = by computing a compact variation 
on the SVD of A: 

A = [/Sy^, where U'^BV = I. 

For simplicity, let us assume that A has full rank p. The V vectors are the eigenvectors of the matrix 
A-^i?A, while the U vectors are the eigenvectors of the matrix AA^ B corresponding to the non-zero 
eigenvalues. There is also a version involving the two matrices 

A^ 

I and 

T 



A' 

This SVD may be expressed in terms of the quotient SVD | 
Given the SVD, we may follow geodesies by computing 









B 


A^ 











A 






Y{t)^{YV U)(^^^^V^. 



All the Y along this curve have the property that Y^BY — I. For the problem of minimizing ^ trY^AY, 
line minimization decouples into p two-by-two problems just as in the ordinary eigenvalue problem. 
Parallel transport, conjugacy, and the second covariant derivative may all be readily worked out. 

4.6. Electronic Structures Computations. In this section, we briefly survey a research area 
where conjugate gradient minimization of non-quadratic but smooth functions on the Stiefel and Grass- 
mann manifolds arise, the ab initio calculation of electronic structure within the local density approxi- 
mation. Such approaches use only the charge and mass of electrons and atomic nuclei as input and have 
greatly furthered understanding of the thermodynamic properties of bulk materials |l2] , the structure and 



dynamics of surfaces |5l|, |6l[ , the nature of point defects in crystals 1 60 , and the diffusion and interaction 
of impurities in bulk materials jsj]. Less than ten years ago, Car and Parrinello in a watershed paper 
proposed minimization through simulated annealing. Teter and Gillan |^ , |8^ later introduced conjugate 
gradient based schemes and demonstrated an order of magnitude increase in the convergence rate. These 
initial approaches, however, ignored entirely the effects of curvature on the choice of conjugate search 
directions. Taking the curvature into partial account using a generalization of the Riemannian projection 
led to a further improvement in computation times by over a factor of three under certain conditions . 
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Our ability to compute ab initio, using only the charge and mass of electrons and atomic nuclei 
as input, the behavior of systems of everyday matter has advanced greatly in recent years. However, 
the computational demands of the approach and the attendant bounds on the size of systems which 
may be studied (several hundred atoms) have limited the direct impact of the approach on materials and 
chemical engineering. Several ab initio applications which will benefit technology tremendously remain out 
of reach, requiring an order of magnitude increase in the size of addressable systems. Problems requiring 
the simultaneous study of thousands of atoms include defects in glasses (fiber optics communications), 
complexes of extended crystalline defects (materials' strength and processing), and large molecules (drug 
design) . 

The theoretical problem of interest is to find the smallest eigenvalue Eq of the Schrodinger equation 
in the space of 3iV dimensional skew-symmetric functions. 



where the Hamiltonian operator H is defined by 



E (-l< + y^on{rr.))+\ E ||2 - 



Here, iV is the number of electrons in the system under study, now typically on the order of several 
hundred, is the position of the ith electron, Vionif) is the potential function due to the nuclei and 
inner electrons, and the second summation is recognized as the usual Coulomb interactions. Directly 
discretizing this equation at M grid-points in space would lead to absurdly huge eigenvalue problems 
where the matrix would be M^-by-M^. This is not just a question of dense versus sparse methods, a 
direct approach is simply infcasible. 

The fundamental theorems which make the ab initio approach tractable come from the density func- 
tional theory of Hohenberg and Kohn and Kohn and Sham . Density functional theory states that 
the ground states energy of a quantum mechanical system of interacting electrons and ions is equal to the 
solution of the problem of minimizing an energy function over all possible sets of N three-dimensional 
functions (electronic orbitals) obeying the constraints of orthonormality. Practical calculations generally 
use a finite basis to expand the orbitals, but for purposes of discussion, we may discretize the problem 
onto a finite spatial grid consisting of M points. The Kohn-Sham minimization then becomes. 



(4.9) Eq = min E(X) 

XTX=In 



min tiiX'^HX) + f{p{X)), 



where each column of X is a different electronic orbital sampled on the spatial grid, p is the vector 
Pi{X) = H is an M-by-M matrix (single-particle Hamiltonian), and / is a function which we 

leave unspecified in this discussion. In full generality the X are complex, but the real case applies for 
physical systems of large extent that we envisage for this application and we, accordingly, take X to 
be real in this discussion. 

Recent advances in computers have enabled such calculations on systems with several hundreds of 
atoms ^, 11 . Further improvements in memory and performance will soon make feasible computations 
with upwards of a thousand atoms. However, with growing interest in calculations involving larger 
systems has come the awaren ess t hat as the physical length of systems under study increases, the Hessian 
about the minimum of Eq. ( |4.9| ) becomes increasingly ill-conditioned and non-conjugate minimization 
approaches exhibit a critical slowing down |8^ . This observation prompted workers [^2| Q to apply 
conjugate gradient concepts to the problem, and now dozens of researchers have written papers using 
some form of the conjugate gradient method. In particular, one has a Grassmann problem when the 
number of electrons in each state is constant (i.e., two one spin up and one spin down). This is what 
happens in calculations on semiconductors and "closed shell" atoms and molecules. Otherwise, one has a 
Stiefel problem such as when one has metals or molecules with partially filled degenerate states. 
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The framework laid out in this discussion may be of practical use to the ab initio density-functional 
community when the inner product computation through the Hessian of E{X) is no more computation- 
ally complex to evaluate than calculating the energy function E{X) or maintaining the orthonormality 
constraints X^X — 1^. A suitable form for this inner product computation is 



(4.10) 



tT{x'^{H + V){XY'^Z)) 



where V is the diagonal matrix defined by Vij = {d f / d pi)6ij , Oi = YinXin, Ti = J2n ^inXin- Written 
this way, the first two terms of Eq. (4.10) have the same form and may be evaluated in the same manner 
as the corresponding terms in Eq. (4.9), with a and r playing roles similar to p. T he third term, coming 
from the curvature, may be evaluated in the same way as the first term of Eq. ( 4.10 ) once given the object 
XY^ Z, which is no more computationally complex to obtain than the Gram-Schmidt orthonormalization 
of an object like X. 

4.7. Subspace Tracking. The problem of computing the principal invariant subspace of a symmet- 
ric or Hermitian matrix arises frequently in signal processing applications, such as adaptive filtering and 
direction finding ^4, 72, ^, ^ . Frequently, there is some time- varying aspect to the signal processing 
problem, and a family of time-varying principal invariant subspaces must be tracked. The variations may 
be due to either the addition of new information as in covariance matrix updating, a changing signal 
environment, or both. For example, compute the principal invariant subspace of either of the covariance 
matrices 



(4.11) 
(4.12) 



Rk = Rk-i + Xkxl A: = 1, 2, 
R{t) = a continuous function of t 



and Xk is given 



at every iteration or at discrete times. Eq. (4.11) typically arises from updating the sample covariance 
matrix estimate; Eq. (4.12), the more general case, arises from a time-varying interference scenario, e.g., 
interference for airborne surveillance radar |8^, |7^ . Solving this eigenvalue problem via the eigenvalue or 
singular value decompositions requires a large computational effort. Furthermore, only the span of the first 
few principal eigenvectors may be required, whereas decomposition techniques compute all eigenvectors 
and eigenvalues, resulting in superfluous computations. Approaches to this problem may be classified as 
standard iterative methods |Q, methods exploiting rank 1 updates |5^, ^ |9l, |8^, |l^, |5^, i.e., 
Eq. (4.11), Lanczos based methods [20|^9l|, |9^, gradient based methods ^2[ ]lO|, conjugate gradient 
based methods [3l, |l^, |7^, |7^, [78 1, which are surveyed by Edelman and Smith pl[|, Raylcigh-Ritz 
based methods 37[ ^0| , and methods that exploit covariance matrix or array structure [|68|, pof . 

If the subspace does not change quickly over (discrete or continuous) time, then the desired solu- 
tion will be close to the previously computed solution, and an iterative gradient-based algorithm such 
as the conjugate gradient algorithm may be computationally attractive for the subspace tracking prob- 
lem. Thus the subspace tracking problem is treated as a time-varying optimization problem. Other 
conjugate gradient methods for computing principal invariant subspaces in a signal processing context 
have appeared jl^, |7l|, howe ver, these conjugate gradient techniques do not exploit the struc- 

ture of the subspace constraint (see §4^). Instead, we empl oy t he conjugate gradient method on the 
Grassmann manifold, or an approximation of it discussed in §3^. Comon and Golub |20| describe and 
compare a wide variety of different algorithms for the problem of exponential covariance matrix updates, 
with particular emphasis on Lanczos and gradient-based algorithms. Yang, Sarkar, and Arvas |^ sur- 
vey some conjugate gradient algorithms applied to computing the principal invariant subspace of a fixed 
symmetric matrix. We adopt the general assumption that the matrix may change arbitrarily over time, 
but that it must vary "slowly enough" so that using a conjugate gradient based approach is computa- 
tionally efficient. This last constraint is, of course, dependent upon the application. For the example of 



space-time adaptive processing for airborne radar with a rotating antenna. Smith [Tg] shows that this 
method is capable of tracking the principal invariant subspace of clutter interference; however, when 
the interference dimension p is increased to account for new interference eigenvalues, one does better to 
compute the eigendecomposition from scratch and use it to initiate a new subspace track. 
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4.8. Newton's Method for Invariant Subspace Computations. Methods for refining estimates 
for invariant subspace computations have been proposed by Chatehn n3, HM, Dongarra, Moler, and 



Wilkinson [g9|, and Stewart |80 . Demmel |22, §3] proposes a unified approach by showing that they are 
all solutions to a Riccati equation. 

These algorithms, when applied to symmetric matrices, are all variations on our geometrical Newton 
algorithm and may be understood in this context. There is nothing special about the eigenvalue problem; 
Newton's method for any function on the Grassmann manifold yields a Sylvester equation in the tangent 
space. The reason a Riccati equation arises rather than a Sylvester equation is that the previous algorithms 
formulate the problem in an affine space with arbitrary constraints. Previous researchers knew the 
quadratic term in the Riccati equation belonged there, and knew that it somehow is related to the 
orthogonality constraints, but we now see that it is an artifact of a flat space derivation. 

Let us take a closer look. Previous researchers proposed algorithms for invariant subspaces by asking 
for a solution to the matrix equation 

AY -YB = 
made nondegenerate by imposing the affine constraint 

z^y = /, 

for some arbitrary choice of Z. In the Dongarra et al. case, Z may be obtained by inverting and transposing 

an arbitrary pxp minor of the nxp matrix Y. In Moler's Matlab notation Z=zeros (n,p) ; Z(r , : )=inv(Y(r , :))',( 

where r denotes a p- vector of row indices. For Stewart, Z = Y(Y'^Y)-^. 

A mathematically insightful approach would require no arbitrary choice for Z. We would simply 
specify the problem by performing Newton's method on the function F(Y) = i tr Y^AY on the Grassmann 
manifold. The stationary points of F{Y) are the invariant subspaces. There is no need to specify any 
further constraints and there are no degeneracies. (Notice that asking for the solution to AY = YiY^AY) 
subject to Y^Y — I is a, degenerate problem.) 

Newton's method requires the solution A to the Sylvester equation 

n(AA - A(y^^y)) = -hay, 

where 11 = (/ — YY'^) denotes the projection onto the tangent space of the Grassmann manifold and 
G = HAY is the gradient. The solution is A = —Y + Z{Y'^Z)~^, where Z is the solution to the Sylvester 
equation AZ ~ Z{Y^AY) = Y. Y may be chosen so that Y'^AY is diagonal, yielding simultaneous 
Rayleigh quotient iterations. If we move along the tangent and project rather than the geodesic we have 
the iteration sending Y to the Q factor in the QR decomposition of Z. 

4.9. Reduced Gradient Methods, Sequential Quadratic Programming, and Lagrange 
MultipHers. In this section, we generalize beyond the Stiefel and Grassmann manifolds to show how 
the language and understanding of differential geometry provides insight into well-known algorithms for 
general non-linear constrained optimization. We will show the role that geodesies play in these algorithms. 
In the next subsection, we will then apply the geometrical intuition developed here to directly formulate 
regularized sequential quadratic programs as is needed in eigenvalue optimization. 

Here we study sequential quadratic programming (SQP) and reduced gradient methods (RGM). By 
SQP we mean the algorithm denoted as Newton SQP by Boggs and ToUe @, p. 14] , SQP by Nash and 
Sofer p. 512], and QP-based projected Lagrangian by Gill, Murray, and Wright p. 238, Eq. (6.41)]. 
By RGM, we specifically mean the method sometimes denoted as the reduced Hessian method 0, p. 25] , 
other times simply denoted RGM ]^, p. 520], and yet other times considered an example of an RGM 
p. 221, Eq. (6.17)]. The difference is that RGM is derived based (roughly) on the assumption that one 
starts at a feasible point, whereas SQP does not. 
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We begin by interpreting geometrically the Lagrangian function as it is used in constrained optimiza- 
tion. Consider the optimization problem 

(4.13) min f{x) given the constraint that h{x) = G R^. 

For simplicity we consider the case where the level surfaces h(x) = c are manifolds {dh/dx has full rank 
everywhere) and we work with the Euclidean metric. In the Euclidean case, the formulations are routine 
in the optimization community, but we have not seen the geometric intuition (particularly geometric 
interpretations away from the optimization point and the role that geodesies play "behind-the-scenes" ) in 
the optimization references that we have consulted. Numerical Lagrange multiplier issues are discussed 
in and ji^, for example. In this paper, we give the new interpretation that the Hessian of the 
Lagrangian is the correct matrix for computing second derivatives along geodesies at every point, not 
only as an approximation to the result at the optimal point. 

At every point x G R", it is possible to project the gradient of / onto the tangent space of the level 
surface through x. This defines a sort of flattened vector field. In terms of formulas, projection onto the 
tangent space (known as computing least-squares Lagrange multiplier estimates) means finding A that 
minimizes the norm of 

(4.14) U = f,-X-h,, 
i.e., 

(4.15) \^Uhl{Khl)-\ 

At every point x e R" (not only the optimal point) Lagrange multipliers are the coordinates of in the 
normal space to a level surface of the constraint, i.e., the row space of h^. (Our convention is that fx is 
a 1-by-n row vector, and is a p-hy-n matrix whose rows are the linearizations of the constraints.) 

If x{t) is any curve starting at x{Q) — x that is constrained to the level surface at x, then Lixi 
computes the derivative of / along the curve. (In other words, Jjx is the first covariant derivative.) The 
second derivative of / along the curve is 

(f 

(4.16) " i'^UxX + LxX. 

At the optimal point Jj^ is 0, and therefore Jj^x is a second order model for / on the tangent space to the 
level surface. The vanishing of the term involving Jj^ at the optimal point is well-known. 

The idea that we have not seen in the optimization literature and that we believe to be new is the 
geometrical understanding of the quantity at a non-optimal point: At any point at all, is tangent to 



the level surface while x(t) is normal when x is a geodesic. The second term in Eq. ( 4.16 ) conveniently 
vanishes here too because we are differentiating along a geodesic! Therefore, the Hessian of the Lagrangian 
has a natural geometrical meaning, it is the second derivative of / along geodesies on the level surface, 
i.e., it is the second covariant derivative in the Euclidean metric. 

We now describe the RG method geometrically. Starting at a point x on (or near) the constraint 
surface h{x) = 0, the quadratic function 



models / (up to a constant) along geodesies emanating from x. The x that minimizes this function is the 
Newton step for the minimum for /. Intrinsic Newton would move along the geodesic in the direction 
of i; a length equal to ||i||. Extrinsically, we can move along the tangent directly from a; to x -I- i and 
then solve a set of nonlinear equations to project back to the constraint surface. This is RGM. It is a 
static constrained Newton method in that the algorithm models the problem by assuming that the points 
satisfy the constraints rather than trying to dynamically move from level surface to level surface as does 
the SQR 
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In SQP, we start on some level surface. We now notice that the quadratic function 
(4.17) L^i + \x^\j^^x 

can serve as a model not only the first and second covariant derivative of / on the level surface through 
X but also on level surfaces for points near x. The level surface through x is specified by the equation 
hxX = 0. Other parallel level surfaces + c — Q. The right choice for c is h{x)^ which is a Newton 

step towards the level surface h{x) = 0. Therefore if the current position is x, and we form the problem of 
minimizing TjxX + ^x^~LxxX subject to the constraint that hxX + h{x) = 0, we are minimizing our model 
of / along geodesies through a level surface that is our best estimate for the constraint h{x) = 0. This is 
the SQP method. 

Practicalities associated with implementing these algorithms are discussed in the aforementioned 
texts. Generalizations to other metrics (non-Euclidean) are possible, but we do not discuss this in detail. 
Instead we conclude by making clear the relationship between Lagrange multipliers and the Christoffel 
symbols of differential geometry. 

To derive the geodesic equation, let f{x) = Xk, the kth coordinate of x. From Eq. ( [1.15| ), the Lagrange 
multipliers are hx^{hxhx)~^ ■ Since fxx = we then have that the geodesic equations are Xk — x^Tt^^x 
(fc = 1, . . . , n), where L^^ denotes, —h'^^{hxhx)~^ ■ hxx, the Hessian of the Lagrangian function of Xk- 
The matrix T'' = — L^^ is the Christoffel symbol of differential geometry. 



4.10. Eigenvalue Optimization. The geometric approach allows the formulation of sequential 
quadratic programming problems when the Lagrange multiplier formalism breaks down due to coordinate 
singularities. Specifically, the geometric insight from the previous subsection is that during the execution 
of a sequential quadratic program there are three types of directions. The first direction is towards the 
constraint manifold. SQP performs a Newton step in that direction. The second family of directions is 
parallel to the constraint manifold. SQP forms a quadratic approximation to the objective function in 
the parallel level surface obtained from the Newton step. The remaining directions play no role in an 
SQP and should be ignored. 

Consider the problem of minimizing the largest eigenvalue of A{x)^ an n-hy-n real symmetric matrix- 
valued function of x £ R™ when it is known that at the minimum, exactly p of the largest eigenvalues 
coalesce. Overton and Womersley formulated SQPs for this problem using Lagrange multipliers and 
sophisticated perturbation theory. The constraint in their SQP was that the p largest eigenvalues were 
identical. We will here consider the case of m > p{p + l)/2. One interesting feature that they observed 
was the non-differentiability of the largest eigenvalue at the optimum. Following the geometry of the 
previous section, a new algorithm without Lagrange multipliers may be readily devised. There will be no 
Lagrange multipliers because there will be no consideration of the third directions mentioned above. 

We will write A for A{x). Let A — Y^AY, where the orthonormal columns of Y span the invariant 
subspace for thep largest eigenvalues of A, Ai, . . . , Ap. We let F{A) — Ai and L(A) = tr(A) = Ai + - • ■ + Xp. 
Unlike the function F{A), L(A) is a differentiablc function at the optimal point. One might have guessed 
that this L(yl) was the right L(A), but here is how one can logically deduce it. 

The trick is to rely not on the Lagrange multiplier formalism of constraint functions, but rather on 
the geometry. Geometry has the power to replace a long complicated derivation with a short powerful 
one. Once the techniques are mastered, geometry provides the more intuitive understanding. There 
is no convenient h{A) to express the constraint of multiple eigenvalues; artificially creating one leads 
to unnecessary complications due to the coordinate singularity when one moves from the level surface 
h{A ) ~ to another level surface. The right way to understand the coordinate singularity was described 



in § [4.2[ . The direction of the Newton step must be the first order constraint of the coallescing of the 
eigenvalues. Using the notation of §L2, the parallel directions are the tangent vectors of Sn,p- AH other 



directions play no role. The natural level surfaces are thereby obtained by shifting the p largest eigenvalues 
by a constant, and developing the orthogonal eigenvector matrix Q{0) as in Eq. ( 2.32| ). 
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The message from § [4. 9] is that whatever function we are interested in, we are only interested in the 
component of the gradient in the direction parallel to Sn,p- The very construction of a Lagrangian L then 
may be viewed as the construction of an appropriate function with the property that is parallel to the 



tangent vectors of Sn,p- Of course the tangent space to Sn,p (see §4.2) includes projection matrices of the 
form X^iLi onyiVi ^ where yi is the eigenvector corresponding to Ai, only when the Ui are all equal. This cor- 
responds to an identical shift of these eigenvalues. Therefore to form the correct gradient of the objective 
function F{A) — Ai everywhere, we should replace the true gradient, which is well known to be the spectral 
projector yiyj , with its component in the direction YY"'" , which is an Sn,p tangent vector. Integrating, we 

now see that the act of forming the Lagrangian, which we now understand geometrically to mean replacing 
Uiu'i with (projecting the gradient to the surface of uniform shifts) amounts to nothing more than 

changing the objective function from F{x) to L(x) — tr(A) — ti- Y'^AY. While one might have guessed 
that this was a convenient Langrangian, we deduced it by projecting the gradient of f{x) on the tangent 
space of a level surface. The components of f{x) that we removed implicitly would have contained the 
Lagrange multipliers, but since these components are not well defined at the coordinate singularity, it is 
of little value to be concerned with them. 

Now we must explicitly consider the dependence of L on x. Our optimization step is denoted Ax, 
and A and A respectively denote [A^jAa;] and [A^^^AxAx] (notation from |^). It is easy to verify that 

(4.18) L^^tvY^AY, 

(4.19) = tr(r'^ir + y'^Ay + y'^Ay), 

where Y is the solution to 

(4.20) YA-{I - YY^)AY = (/ - YY'^)Ay 

that satisfies Y'^Y = 0. The resulting sequential quadratic program over Aa; is then 

(4.21) min 

subject to the linear constraint (on Ax) that 

(4.22) Y^Ay + A = aI, 

where the scalar a is arbitrary. 

Let us explain all of these steps in more detail. The allowable Y are Grassmann directions, Y^Y = 0. 
Otherwise, we are not parallel to the constraint surface. Equation ( 4.18|) i s the derivative of Y^AY. 



Noting that AY = yA and Y^Y = 0, two terms disappear. Equation (4.19) is trivial but we note the 



problem that we do not have an explicit expression for Y, we only have A, Y and A. Fortunately, the 



perturbation theory for the invariant subspace is available from Equation ( 4.2C ) . It m ay be derived by 
differentiating AY = YA and substituting A — Y^Ay^ The solution to Equation ( 4.20 ) is unique so long 
as no other eigenvalue of A is equal to any of Ai, . . . , Ap. 

The linear constraint on Ax is the one that infinitesimally moves us to the constraint surface. It is 
the condition that moves us to a diagonal matrix. Therefore, A = Y^AY when added to A must be a 
scalar multiple of the identity. This is a linear condition on A and therefore on Ax. The a does not 
explicitly appear in the constraint. 

5. Conclusions. This paper offers a new approach to the algorithms in numerical analysis involving 
orthogonality constraints. We have found that these algorithms should be understood as optimization 
algorithms in the correct geometrical setting; however, they rarely are. 



Alert readers may notice that this is really the operator used in the definition of "sep" in numerical linear algebra texts. 
The reader really understands the theory that we have developed in this paper if he or she now can picture the famous "sep" 
operator as a Lie bracket with a Grassmann tangent and is convinced that this is the "right" way to understand "sep" . 



Orthogonality Constraints 



39 



Newton on the 

Grassmann 

Manifold 



PCG or — 
Approximate 
Newton on the 
Grassmann 
Manifold 



CG on the 

Grassmann 

Manifold 



SD on the 

Grassmann 
Manifold 



Block Rayleigh 
Quotient 



RQI 



Newton Subspace 
Improvement 

Demmel 87 
Chatelin 84,93 

Dongarra,Moler,Wilkinson 83 
Stewart 73 

Nonlinear PCG 

Gillan 89 

Arias 92 

Payne, Teter, Allan 92 

BIk Inv Iteration — T 
Block PCG 

Alsen 71 



i 



t Nonlinear CG 

Linear Eigenvalue 
CG 

Sameh, Wisniewski 82 
Fu, Dowling 95 



Inv Iteration 
Davidson 
PCG 



Gradient Flows- 



Perdon, Gamb.89 

Hessian 

Geradin 71 

PR or FR 

Bradbury, Flet.66 
Fox, Kapoor 69 

Fried 69, 72 
Anderson 71 

Haimi-cohen 
Cohen 87 

Ruhe,74 
Yang, Sarkar, 
Arvas 89 
Fuhrmann, Liu 84 

Conjugate 
through A 

Chen, Sarkar 86 

Power 
Method 



Fig. 4.2. Taxonomy of algorithms defined from the Grassmann manifold. 



As a concluding example of the insight gained, wc propose a Grassmann based taxonomy for problems 
related to the symmetric eigenproblem. This taxonomy allows us to view algorithms not as isolated 
entities, but as objects with a coherent mathematical structure. It is our hope that developers of new 
algorithms and perturbation theories will benefit from the analytical approach that lead to our taxonomy. 
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In this taxonomy, algorithms are viewed as either restrictions or approximations of their parent. Ulti- 
mately, we have Newton's method on arbitrary Riemannian manifolds as the root. One can then restrict 



to a particular manifold such as the Stiefel manifold or, as we illustrate in Figure 4.2, the Grassmann 
manifold. Along the vertical axis in the left column we begin with Newton's method which may be ap- 
proximated first with PCG or approximate Newton methods, then pure conjugate gradient, and finally 
steepest descent. Moving from left to right the idealized algorithms are replaced with more practical 
versions that specialize for particular problems. The second column contains block algorithms, while 
the third contains single eigenvector related algorithms. This abstraction would not be possible without 
geometry. 
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Nonlinear CG 



►Linear Eigenvalue CG 

Sameh, Wisniewski 82 
Fu, Dowling 95 



-•Gradient Flows 



•Inverse Iteration 
•Davidson's Method 
PCG 

Perdon, Gambolati 89 
•Hessian Based 

Geradin 71 
^PR or FR Based 

Bradbury, Fletcher 66 
Fox, Kapoor 69 
Fried 69, 72 
Anderson 71 
Haimi-Cohen, Cohen 87 
Ruhe 74 

Yang, Sarkar, Arvas 89 
Fuhrmann, Liu 84 
•Conjugate through A 

Chen, Sarkar 86 
wPower Method 
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